The Turing Architecture: Volta in Spirit

Diving straight into the microarchitecture, the new Turing SM looks very different to the Pascal SM, but those who’ve been keeping track of Volta will notice a lot of similarities to the NVIDIA’s more recent microarchitecture. In fact, on a high-level, the Turing SM is fundamentally the same, with the notable exception of a new IP block: the RT Core. Putting the RT Cores and Tensor Cores aside for now, the most drastic changes from Pascal are same ones that differentiated Volta from Pascal. Turing’s advanced shading features are also in the same bucket in needing explicit developer support.

Like Volta, the Turing SM is partitioned into 4 sub-cores (or processing blocks) with each sub-core having a single warp scheduler and dispatch unit, as opposed Pascal’s 2 partition setup with two dispatch ports per sub-core warp scheduler. There are some fairly major implications with change, and broadly-speaking this means that Volta/Turing loses the capability to issue a second, non-dependent instruction from a thread for a single clock cycle. Turing is presumably identical to Volta performing instructions over two cycles but with schedulers that can issue an independent instruction every cycle, so ultimately Turing can maintain 2-way instruction level parallelism (ILP) this way, while still having twice the amount of schedulers over Pascal.

Like we saw in Volta, these changes go hand-in-hand with the new scheduling/execution model with independent thread scheduling that Turing also has, though differences were not disclosed at this time. Rather than per-warp like Pascal, Volta and Turing have per-thread scheduling resources, with a program counter and stack per-thread to track thread state, as well as a convergence optimizer to intelligently group active same-warp threads together into SIMT units. So all threads are equally concurrent, regardless of warp, and can yield and reconverge.

In terms of the CUDA cores and ALUs, the Turing sub-core has 16 INT32 cores, 16 FP32 cores, and 2 Tensor Cores, the same setup as the Volta sub-core. With the split INT/FP datapath model like Volta, Turing can also concurrently execute FP and INT instructions, which as we will see, is much more relevant with the RT cores involved. Where Turing differs is in lacking Volta’s full complement of FP64 cores, instead having a token amount (2 per SM) for compatibility reasons and resulting in FP64 throughput being 1/32 the TFLOP rate of FP32. Maimed FP64 is standard for NVIDIA’s consumer GPUs, but what has not been standard until now is Turing’s full 2x FP16 throughput, which was available in GP100 but was crippled in the other Pascal GPUs.

While these details may be more on the technical side of things, in Volta this design seemed inextricably linked to maximizing the most amount of performance from tensor cores, but minimizing disrupting parallelism or coordination with other compute workloads. The same is most likely true with Turing’s 2nd generation tensor cores and RT cores, where 4 independently scheduled sub-cores and granular thread manipulation would be very useful in extracting the most performance out of mixed gaming-oriented workloads, where rendering a single frame would be pulling in multiple blocks of the GPU to work in conjunction. This is actually a concept that circumscribes the RTX-OPS metric, and we will revisit that in depth later.

Memory-wise, every sub-core now has an L0 instruction cache like Volta, with identically sized 64 KB register file. In Volta, this was important in reducing latency when the tensor cores were in play, and in Turing this likely benefits RT cores similarly, which we will discuss in a later section. Otherwise, the Turing SM also has 4 load/store units per sub-core, down from 8 in Volta, but still maintains 4 texture units.

Further up the memory hierarchy is the new L1 data cache and Shared Memory (SMEM) that has been revamped and unified into a single partitionable memory block, another Volta innovation. For Turing, this is looking to be a combined 96 KB L1/SMEM, which traditional graphics workloads divide as 64KB for dedicated graphics shader RAM and as 32 KB for texture cache and register file spill area. Meanwhile, compute workloads can partition the L1/SMEM with up to 64 KB as L1 with the remaining 32 KB as SMEM, or vice versa. For Volta, SMEM can be configured up to 96 KB.

Though many of these details are only of value to developers, there are several important points to make here. One is simply how similar Turing and Volta are, as opposed to ; after all, they are in the same generational compute family. Another is how compute-oriented Volta – and by extension, Turing – are, and the fact that this is being brought to consumers as part of NVIDIA’s proclaimed ‘future of gaming.’ Part of that is, of course, permitting fast FP16 in potential gaming workloads, but Turing goes far beyond that. At the low level, Turing is less about maximizing traditional gaming, and more about maximizing gaming with special technologies such as real-time raytracing.

For their part, NVIDIA points to Turing’s leap in performance from Pascal, from memory hierarchy bandwidth uplifts to 50% more shader performance per core, but unfortunately for today we can’t connect this with any real world data or performance. With concurrent FP/INT execution in gaming, the company is keen to point out that around 36 INT instructions could be freed up by moving to its own pipe, which nevertheless doesn’t describe Turing performance, only the applicability of its concurrent execution feature in games.

It becomes a bit of a complex scenario, as we know that Volta already improved on Pascal in these aspects with concurrent execution, a brand new ISA, and reworked SM. And it doesn’t seem to involve architectural changes for significant clockspeed enhancements a la Pascal from Maxwell, though of course on the process side the 12nm FFN is a factor. So it comes down to special gaming workloads and real-world performance. The latter is not available today, but the former is so important to Turing that it merited dropping ‘GTX’ for ‘RTX’. And of those special workloads, real-time raytracing and RT cores take center stage.

Bounding Volume Hierarchy Turing RT Cores: Hybrid Rendering and Real Time Raytracing
Comments Locked

111 Comments

View All Comments

  • Spunjji - Monday, September 17, 2018 - link

    There's no such thing as a bad product, just bad pricing. AMD aren't out of the game but they are playing in an entirely different league.
  • siberian3 - Friday, September 14, 2018 - link

    Good architectural leap for nvidia but it is sad very few of gamers can afford the new cards.
    And AMD is not doing anything for 2018 and probably navi will be mid range on 7nm
  • V900 - Friday, September 14, 2018 - link

    Meh, it’s always been that way with the newest, fastest GPUs.

    Wait 6 months to a year, and prices will be where people with more modest budgets can play along.
  • B3an - Friday, September 14, 2018 - link

    You must literally live under a rock while also being absurdly naive.

    It's never been this way in the 20 years that i've been following GPUs. These new RTX GPUs are ridiculously expensive, way more than ever, and the prices will not be changing much at all when there's literally zero competition. The GPU space right now is worse than it's ever been before in history.
  • Amandtec - Friday, September 14, 2018 - link

    I read somewhere that
    8800GTX + inflation = 2080ti price
    Without factoring in inflation the prices seem unprecedented.
  • Yojimbo - Saturday, September 15, 2018 - link

    And you must factor in inflation, otherwise you are just pushing numbers around.
  • Yojimbo - Saturday, September 15, 2018 - link

    And comparing the 2080 Ti to previous flagship launch cards is not really proper. The 2080 Ti is a different tier of card. The die size is so much larger than any previous launch GPU. It's just a demonstration of the increase in the amount of resources people are willing to devote to their GPUs, not an indication of an inflation of GPU prices.
  • eddman - Saturday, September 15, 2018 - link

    2006 $600 at 2018 dollar value = $750
  • Samus - Saturday, September 15, 2018 - link

    What inflation, exactly are you talking about. The dollar hasn't had a substantial change in valuation for 20 years (compared to other first-world currency.)

    The USD inflation rate has averaged around 2.7%/year since 2000. That means one dollar in 2000 is now worth slightly less than $1.50 today. That means the top-of-the-line GPU released in 2000, I'd take a guess it was the Geforce2 GTS and/or the 3Dfx Voodoo5 5500, both cost $300.

    For those who want to throw in cards like the Geforce 2 Ultra and the Voodoo5 6000, the former a card for nVidia to 'probe' the market for how much they could milk it going forward (and creating the situation we have today) and the other a card that never actually "launched"...we can include them for fun. The Ultra launched at $500 (even though it was slower than the Geforce 3 that launched 3 months later) and the Voodoo5 6000 had an MSRP set by 3Dfx at $500.

    These were the most expensive gaming-focused GPU's ever made up until that date. Even SLI setups didn't cost $500 (the most expensive Voodoo2 card in the 90's was from Creative Labs @$229/ea - you needed two cards of course - so $460.)

    Ok, so you have the absolute cream-of-the-crop cards in 2000 at $500, one was a marketing stunt, and the other never launched because nobody would have bought it. Realistically the most expensive cards were $300. But we will go with $500.

    The most expensive high-end gaming focused cards now are $1000+

    That would assume an inflation rate of over 5% annually, or the value of the dollar DOUBLING over 2 decades. Which it didn't come close to doing.

    Stop using inflation as an excuse. It's bullshit. These companies are fucking greedy. Especially nVidia. They are effectively charging FOUR TIMES more than they used to for the same market segment card. 20 years ago you would have bought a TNT2 Ultra for $230 bucks and had the ultimate card available. Most people purchased entirely capable mainstream cards for $100-$150 like the TNT2 Pro or the Geforce2 MX400 that ran the most demanding games of the day like Counter Strike and Half-Life at 1024x768 in maximum detail.

    http://www.in2013dollars.com/2000-dollars-in-2018?...
  • Yojimbo - Saturday, September 15, 2018 - link

    "What inflation, exactly are you talking about."

    CPI. Consumer Price Index. Even though inflation has been low for quite a while, $649 in 2013 is $697 today. That's almost $50 more, and it's enough to make up the difference between the 2013 launch price of the GTX 780 and the 2018 launch price of the RTX 2080.

    I'm not sure why you are talking about cards from 20+ years ago. It's not relevant to my reply. In any case, those cards were completely different. The die sizes were much smaller and the cards were much less capable. They did a lot less of the work, as much of it was done on the CPU. The CPU was much more important to the game performance than today, as was the RAM and other components that were worth spending money on to significantly improve the gaming performance/experience.

    "Stop using inflation as an excuse."

    I'm not using inflation as an excuse. I'm using inflation as a tool to accurately compare the prices of cards from different years. And doing so clearly shows that the claim that the OP made is wrong. My reply had nothing to do with whether cards were in general cheaper 20 years ago or not. It was in response to "These new RTX GPUs are ridiculously expensive, way more than ever". That's provably untrue. Why are you replying to me and arguing about some entirely different point I wasn't ever talking about?

Log in

Don't have an account? Sign up now