RDNA 4 - Architecture for the Modern Era

TECHNOLOGY

The release of the RDNA 4 architecture is a watershed moment for AMD Radeon as it represents a serious, modern, and powerful architecture that is feature rich. Today we will showcase some of its biggest improvements over its predecessors.

If you want to gain context or know what previous architectures achieved, please check our previous articles covering GCNRDNA 1, RDNA 2, and RDNA 3.

 

RDNA 3.5 – a mobile powerhouse

Before we get to the super exciting RDNA 4, let’s talk briefly about its predecessor – RDNA 3.5. A mobile-first architecture that was used in mobile APUs but its changes are incorporated into future AMD designs.

The biggest improvement right off the bat is an improvement to general energy efficiency. Performance per watt matters to everyone, but it is absolutely essential to mobile designs where battery life and cooling capacity are limited. With a gain of around 30% over RDNA 3… well how did AMD achieve this on the same node?

One of the improvements is a modification to the Vector General- Purpose Register (VGPR) instructions, allowing it to know if instructions can be reused later on. This saves on code size and improves performance per clock cycle.

RDNA 3.5 also adds floating point operations to the Scalar Unit. It has limits to what it can do FP-wise because AMD could not just add support for all instructions of this type, but there will be speedups in certain workloads. Add to this an improvement to data parallel processing in which instructions can be issued far quicker and you have a de facto improvement to performance and performance per watt.

Lastly from the point of view of the architecture itself – texture sampling was improved. This honestly doesn’t matter much for high performance desktop parts but it did help the mobile design performance!

Overall, we can see that RDNA 3.5 is a smaller step for AMD but it is also an important one. These architectures are becoming very fast and sophisticated so at this point some of the bigger gains left are in these small low-level improvements to how the compiler works or code is handled. Smaller stuff, but remember – several small improvements can add up to a rather large one!

 

RDNA 4 – a giant leap in AI and Ray Tracing!

Seen in: Navi 48, Navi 44

RX 9070 XT, 9070, RX 9060 XT, and others

 

The monster itself - the NITRO+ RX 9070 XT from Sapphire!

Finally – the real star of today’s article – the architecture fueling AMD’s current generation of graphics! Before we begin discussing it, do note that the RX 9070 XT based on the Navi 48 die will be compared chiefly to the RX 7800 XT (and 6900 XT). Make no mistake, it hangs in with the 7900 XTX, but it is in a sense a successor to the Navi 32 and 22 dies. This is also why it is a monolithic chip – on this mature node it was easier to just stick to monolithic designs than to try and make it chiplet based for a GPU of this size.

Speaking of process – N4P is a refined version of the 5nm process that powered most of the previous generation. It has higher transistor density and performance – Navi 48 GPUs like the 9070 XT can reach into the 3.2 Ghz range with relative ease and Navi 44 GPUs can dip their toes into the 3.3 Ghz range. This speed increase over the already fast RDNA 3 predecessors helps in all workloads – from compute and general rasterization to ray tracing and path tracing!

 

RDNA 4 – at a glance!

Feeding the beast is important. The RX 9070 XT hangs in the same tier of performance as the bandwidth monster that is the RX 7900 XTX. That big Navi 31 GPU has a very large last-level cache (Infinity Cache) and far more memory bandwidth. How does the RX 9070 XT keep up? Also, how is it so much faster than the RX 7800 XT?

RDNA 4 uses very fast GDDR6, but this time the gains compared to its GPU predecessor, the RX 7800 XT are relatively small. Just a few % unfortunately. There was the possibility of going for GDDR7 memory but AMD decided against that. It would increase the cost of production and probably lower the initial production runs too, making the GPUs more expensive. Would have been cool to have even more bandwidth and performance, but the tradeoff was smart for RDNA 4.

 

RDNA 4 – at a glance!

Instead, AMD had to get creative with the caches. The first, huge change is the doubling of the L2 cache of the GPU! From 4 MB to 8 MB, coinciding with the shader engines. This is more than the 7900 XTX as well and its faster to boot. As for the Infinity Cache we are now on the 3rd generation. What AMD did this time is quite simple – they literally doubled the wiring connecting these caches, trading in design complexity for much higher bandwidth. This alongside higher clocks help make quite the difference in effective bandwidth. 

To compare, the RX 6900 XT had around 2.3 TB/s of bandwidth on its monstrous Infinity Cache, and around 4.6 TB/s on its L2 cache. Even to this day this is quite decent. The RX 7900 XTX has vast bandwidth too – around 3.4 TB/s on its own 2nd generation Infinity Cache. The NITRO+ RX 9070 XT  is clocking in at 10 TB/s of L2 cache, and 4.5 TB/s on its last level Infinity Cache. Remember that it also has improvements to the command processor which means better prediction of needed data and better cache utilization.

Even with all of that, cache size does matter. The RX 9070 XT has a smaller (compared to the 7900 XTX and 6900 XT) 64 MB last level cache. With that said it is very easy to say that it has more effective bandwidth than the RX 6900 XT and RX 7800 XT! An impressive feat for any GPU!

*It seems AMD reworked the L1 cache to be a read/write buffer, so uh, I think the RDNA 4’s L2 should now be called L1? And the L3 is now the L2? Ahh marketing names!

 

Real-time ray tracing is the future of 3D graphics. First introduced in RDNA 2 and notably refined with RDNA3. RDNA 4 is where AMD is now getting very serious. In fact, it’s here that the new architecture sees its biggest wins! In testing with Ray Tracing we saw the SAPPHIRE NITRO+ RX 9070 XT overcome the (still pretty powerful) SAPPHIRE NITRO+ RX 7900 XTX consistently.

So how was this achieved? How can the RX 9070 XT frequently outcompete a RX 7900 XTX in ray tracing, sometimes by a notable amount too? A GPU, with more bandwidth and more compute units?

The first major improvement is the doubled intersection engines. What AMD did here is quite simple – they literally put two of these engines in each ray accelerator within the GPU. This is a pure doubling from 4 to 8 box tests or two triangle tests per cycle, upping the speed of the bounding volume hierarchy traversal. The ray triangle intersection obviously double as well, from 1 to 2 and the structure is now no longer BVH4 but rather BVH8.

 

The PULSE RX 9060 XT  16G is a great 1080p Ultra card and it can do some decent ray tracing too!

Wait, what is a bounding volume hierarchy (BVH) anyway? A full explanation is outside the scope of this article, but to keep things simple – it is a data structure that helps speed up ray traversal. When a ray is cast into a scene it needs to find out where it intersects with the actual geometry, the game world itself in this case. This is where the BVH structure comes in play – it is a way to make this faster and easier. It is a major part of modern real-time ray tracing pipelines so the faster one accomplishes this task the quicker the GPU can move onto other tasks.

There is another hidden bonus to these wider BVH capabilities. These larger chunks are slightly more efficient from the point of view of a modern GPU, being less weighed down by latency, while benefitting from throughput. Something GPUs excel at in general, though it should be noted that as with all things in engineering there is also a downside. Namely more reliance on memory performance and design complexity.

 

Do note that speeding up the BVH process is significant, but it is not all there is to ray tracing. With that said this change alone may be a 15-50% increase in some ray traced or path traced workloads, of course depending on the engine and scene!

The next improvement is with the new Oriented Bounding Boxes. The BVH structures usually test game worlds while being aligned with a game’s 3D axis – the X, Y, and Z directions so to say. This is fine some of the time but unfortunately there are cases where a game’s geometry is not aligned in such a simple manner – which means that ray intersection tests are less efficient in these cases. The new Oriented Bounding Box (OBB) is a way to test in a manner more aligned with the game’s world – saving both some processing power, memory usage and even on throughput.

A major improvement is related to Primitive Node Compression – this is a way to minimize the BVH’s VRAM consumption and bandwidth requirements. This is extremely important since it helps better use modern GPU’s cache systems.

Lastly there are new instructions in the architecture in general allowing for faster or more efficient processing in certain steps. Real-time ray tracing is an immense ask of any and all modern GPUs, so every little bit counts.

 

One of the coolest advancements is the new Out of Order memory access feature. One of the “problems” AMD faced with previous architectures was that there was a strict ordering on the data return once a request was made in the GPU proper. This is not a problem all of the time, but there were cases where the data for a request was ready and could be processed much sooner even if it was made later. This is what the out of order memory access feature is trying to solve.

Another important improvement is the new dynamic allocation capability of RDNA 4’s registers. This change allows the shaders to request more registers than before and gives the GPU extra opportunity to do parallel work. This change helps in rasterization – noticeably so at that. But its biggest improvement is to ray tracing as well.

What do all of these new technologies mean for performance? Well, in rasterization if we were to equalize the clocks and bandwidth between let’s say a RX 9060 XT and a RX 7600 XT – the RDNA4 card is some 8-27 percent faster in Rasterization. Not bad to be honest, even though the older RDNA3 part is not quite on par with other RDNA3 parts, do remember that RDNA4 also clocks higher too. When we talk ray tracing though things are now even better. A jump of 30-85% in performance and it tends to be higher in more demanding workloads!

Path Tracing is an extreme scenario and in it the newer AMD parts post 70-200% gains relative to their predecessors. More is needed and I suspect software maturity will come in play here, but it is still a big step forwards for Radeon!

Very good… no wonder FSR 4 works so well!

What about AI and Machine Learning? Look I know it gets old fast hearing about this, but it is a part of a GPU’s capability so we need to cover it. It is also important to FSR4! RDNA 4 has massive gains in its AI/ML capabilities. The FP16/BF16 throughput has been doubled and the INT8 and INT4 capabilities have seen a quadrupling. Add to this the new FP8 and BF8 capabilities and RDNA 4 is quite the performant beast here. This is where its jump over the capable RDNA 3 architecture is at its pinnacle. Do note that software and drivers also matter for these workloads but the base level we are starting from is immense this time around.

 

Big gains but consumers should not be too worried about this still!

There is now also support for the PCIE Gen 5 interface, almost doubling the previous bandwidth. This honestly does not matter much right now but it is a good thing for the future of PC Gaming. Still if you have a good motherboard or CPU with Gen 4 support or even full 16 lanes of Gen 3 support, it will be fine for all current RDNA 4 GPUs.

And last but not least – there are improvements to the display engine. Some allow for a better streaming and recording experience with gains in HEVC, AV1, and H.264. Alongside other optimization as well. And the addition of the Hardware Flip Queue feature allows for some CPU power savings since video frame scheduling is now handled on the GPU side. Alas, nothing significant on the display port or HDMI side, but UHBR 13.5 is competitive for now and can easily fuel even the most demanding 4K/240 HDR monitors pretty well.

One hell of a good step!

 

Some gorgeous systems can be built with the RX 9070 Pure!

RDNA 4 brings some much needed and very notable improvements to Radeon’s technology. With massive gains in AI and Machine Learning, notable improvements to the Display engine, streaming, big gains in traditional rasterization and even bigger gains in ray tracing – this is a one hell of an engineering effort from AMD!

With that said, make no mistake – it is just another step forward. What I see from AMD right now excites me for the future of Radeon graphics!

I wish they’d release the whitepaper though…

Many thanks to SAPPHIREfor providing me with the excellent NITRO+ RX 9070 XT. Also, thanks to Nemez, the excellent articles from Chips and Cheese and the nerdy knowledge and guidance from Sebastian Castellanos!

The articles content, opinions, beliefs and viewpoints expressed in SAPPHIRE NATION are the authors’ own and do not necessarily represent official policy or position of SAPPHIRE Technology.

Alexander Yordanov
My name is Alexander and I am an enthusiastic PC Gamer from Sofia, Bulgaria. Video games have been my go-to hobby for as long as I can remember. I started with good old DOOM and Warcraft 1 and also had a Terminator console. In time my often outdated hardware has made me read up Tech Guides and try to understand what goes within a game as well as how to appreciate it or understand it better.

JOIN THE NATION

SIGN UP
JoinSapphireYoutube_logo

COMMENTS