The Polaris Architecture: In Brief

For today’s preview I’m going to quickly hit the highlights of the Polaris architecture.

In their announcement of the architecture this year, AMD laid out a basic overview of what components of the GPU would see major updates with Polaris. Polaris is not a complete overhaul of past AMD designs, but AMD has combined targeted performance upgrades with a chip-wide energy efficiency upgrade. As a result Polaris is a mix of old and new, and a lot more efficient in the process.

At its heart, Polaris is based on AMD’s 4th generation Graphics Core Next architecture (GCN 4). GCN 4 is not significantly different than GCN 1.2 (Tonga/Fiji), and in fact GCN 4’s ISA is identical to that of GCN 1.2’s. So everything we see here today comes not from broad, architectural changes, but from low-level microarchitectural changes that improve how instructions execute under the hood.

Overall AMD is claiming that GCN 4 (via RX 480) offers a 15% improvement in shader efficiency over GCN 1.1 (R9 290). This comes from two changes; instruction prefetching and a larger instruction buffer. In the case of the former, GCN 4 can, with the driver’s assistance, attempt to pre-fetch future instructions, something GCN 1.x could not do. When done correctly, this reduces/eliminates the need for a wave to stall to wait on an instruction fetch, keeping the CU fed and active more often. Meanwhile the per-wave instruction buffer (which is separate from the register file) has been increased from 12 DWORDs to 16 DWORDs, allowing more instructions to be buffered and, according to AMD, improving single-threaded performance.

Outside of the shader cores themselves, AMD has also made enhancements to the graphics front-end for Polaris. AMD’s latest architecture integrates what AMD calls a Primative Discard Accelerator. True to its name, the job of the discard accelerator is to remove (cull) triangles that are too small to be used, and to do so early enough in the rendering pipeline that the rest of the GPU is spared from having to deal with these unnecessary triangles. Degenerate triangles are culled before they even hit the vertex shader, while small triangles culled a bit later, after the vertex shader but before they hit the rasterizer. There’s no visual quality impact to this (only triangles that can’t be seen/rendered are culled), and as claimed by AMD, the benefits of the discard accelerator increase with MSAA levels, as MSAA otherwise exacerbates the small triangle problem.

Along these lines, Polaris also implements a new index cache, again meant to improve geometry performance. The index cache is designed specifically to accelerate geometry instancing performance, allowing small instanced geometry to stay close by in the cache, avoiding the power and bandwidth costs of shuffling this data around to other caches and VRAM.

Finally, at the back-end of the GPU, the ROP/L2/Memory controller partitions have also received their own updates. Chief among these is that Polaris implements the next generation of AMD’s delta color compression technology, which uses pattern matching to reduce the size and resulting memory bandwidth needs of frame buffers and render targets. As a result of this compression, color compression results in a de facto increase in available memory bandwidth and decrease in power consumption, at least so long as buffer is compressible. With Polaris, AMD supports a larger pattern library to better compress more buffers more often, improving on GCN 1.2 color compression by around 17%.

Otherwise we’ve already covered the increased L2 cache size, which is now at 2MB. Paired with this is AMD’s latest generation memory controller, which can now officially go to 8Gbps, and even a bit more than that when oveclocking.

AMD's Path to Polaris Gaming Performance
Comments Locked

449 Comments

View All Comments

  • xthetenth - Wednesday, June 29, 2016 - link

    It's a small chip and cheap card, so the design is delivering on the goal of making that level of performance cheap.
  • fanofanand - Wednesday, June 29, 2016 - link

    The problem as I see it, is that it's barely cheaper than a 970 that performs similarly. I get the whole 3.5 Gb issue with the 970, but based on those charts they are neck and neck with the 970 often beating it. Maybe my expectations were out of whack, but I had really hoped that AMD would be offering 970 performance for 960/950 pricing, given the updated node.
  • Drumsticks - Wednesday, June 29, 2016 - link

    I'm not expecting an /upgrade/ from the 390, but any insight into why the 480 barely beats the 390 despite 10% more shaders? Where are all of the uarch changes going to? Is it a lack of ROPs? That's about the only thing I can think of. Performance at 1440p seems fairly eh.
  • watzupken - Wednesday, June 29, 2016 - link

    I think its ROP starved. Which is typically the case for cards in the mid range.
  • zoxo - Wednesday, June 29, 2016 - link

    32 ROPs are very much more at home at 1080p. U have some leeway with frequency, but 1440p is a big jump in pixel count.
  • D. Lister - Wednesday, June 29, 2016 - link

    Yes, 32 ROPs vs 64 ROPs of the 390. It really only starts showing at >1080p resolutions though.
  • extide - Wednesday, June 29, 2016 - link

    480 has LESS shaders than 390.
    480 - 2304
    390 - 2560
    390X - 2816
  • smackosaurus - Wednesday, June 29, 2016 - link

    So no 980 in the chart?
    Would be interesting to see a comparison in DX12 with the 480 and 980, but somehow after week of people saying the 480 was near 980 levels in DX 12...the 980 was somehow left out.
    Great job.
  • Ryan Smith - Wednesday, June 29, 2016 - link

    You can find that data (and more) in Bench: http://www.anandtech.com/bench/product/1748?vs=171...

    Otherwise 980 isn't in these charts as it's not really a meaningful comparison. Retail sales have already started winding down, and in terms of performance the RX 480 averages just 3% ahead of GTX 970. It's not a 980-level card.
  • warreo - Wednesday, June 29, 2016 - link

    I have to disagree with you Ryan. People (unfairly or not, just read the all the comments) expected this to land somewhere between 970/980, so I really am not sure how you can say it's "not really a meaningful comparison." To me, it's good to know that it's basically equivalent to a 970, but also useful to know that it's on average 15-20% slower than a 980 by extension.

    Again, I could (and did) look it up on the Bench, but it would have been useful to have in the charts off the bat.

Log in

Don't have an account? Sign up now