CPU Performance

For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.

Here are the single threaded results.

Single Threaded Tests
AMD Ryzen 9 5950X
AnandTech SMT Off
Baseline
SMT On 
y-Cruncher 100% 99.5%
Dwarf Fortress 100% 99.9%
Dolphin 5.0 100% 99.1%
CineBench R20 100% 99.7%
Web Tests 100% 99.1%
GeekBench (4+5) 100% 100.8%
SPEC2006 100% 101.2%
SPEC2017 100% 99.2%

Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.

The multithreaded tests are a bit more diverse:

Multi-Threaded Tests
AMD Ryzen 9 5950X
AnandTech SMT Off
Baseline
SMT On
Agisoft Photoscan 100% 98.2%
3D Particle Movement 100% 165.7%
3DPM with AVX2 100% 177.5%
y-Cruncher 100% 94.5%
NAMD AVX2 100% 106.6%
AIBench 100% 88.2%
Blender 100% 125.1%
Corona 100% 145.5%
POV-Ray 100% 115.4%
V-Ray 100% 126.0%
CineBench R20 100% 118.6%
HandBrake 4K HEVC 100% 107.9%
7-Zip Combined 100% 133.9%
AES Crypto 100% 104.9%
WinRAR 100% 111.9%
GeekBench (4+5) 100% 109.3%

Here we have a number of different factors affecting the results.

Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.

Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.

The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.

Overall

In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.

In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.

For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.

Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.

Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.

Investigating SMT on Zen 3 Gaming Performance (Discrete GPU)
Comments Locked

126 Comments

View All Comments

  • Machinus - Thursday, December 3, 2020 - link

    Can you sell me yours so I can try one?
  • Marwin - Thursday, December 3, 2020 - link

    For me the main question is not whether SMT is bad or good in multithread, but how it is good or bad for 2-4-6 thread loads on for example 12 core Ryzen. When windows may or may not schedule threads to real cores (by 1 thread of 1 core) or to SMT cores in series
  • Duraz0rz - Thursday, December 3, 2020 - link

    IIRC, Windows knows what cores are real vs virtual and what virtual core maps to a real core. It shouldn't matter if a thread is scheduled on a real or virtual cores, though. If a thread is scheduled on a virtual core that maps to a real core that's not utilized, it still has access to the resources of the full core.

    SMT doesn't come into play until you need more threads than cores.
  • GreenReaper - Thursday, December 3, 2020 - link

    That's not *quite* true. Some elements are staticly partitioned, notably some instruction/data queues. See 20.19 Simultaneous multithreading in https://www.agner.org/optimize/microarchitecture.p...
    "The queueing of µops is equally distributed between the two threads so that each thread gets half or the maximum throughput."

    This partitioning is set on boot. So, where each thread might get 128 queued micro-ops with SMT off, you only get 64 with it on. This might have little or no impact, but it depends on the code.

    The article itself says: "In the case of Zen3, only three structures are still statically partitioned: the store queue, the retire queue, and the micro-op queue. This is the same as Zen2."
  • jeisom - Thursday, December 3, 2020 - link

    Honestly it looks like you provided a 3rd viewpoint. As these are general purpose processors it really depends on the workload/code optimization and how they are optimized for a given targeted workload.
  • jospoortvliet - Thursday, December 3, 2020 - link

    Hmmm, if you have a *very* specific workload, yes, 'it depends', but we're really talking HPC here. Pretty much nothing you do at home makes it worth rebooting-and-disabling-SMT for on an AMD Zen 3.
  • Holliday75 - Thursday, December 3, 2020 - link

    The confusion comes in because these are consumer processors. These are not technically HPC. Lines are being blurred as these things make $10k CPU's from 5-10 years ago look like trash in a lot of work loads.
  • GeoffreyA - Thursday, December 3, 2020 - link

    Interesting article. Thank you. Would be nice to see the Intel side of the picture.
  • idealego - Thursday, December 3, 2020 - link

    I imagine compiler optimization these days is tuned for SMT. Perhaps this could have been discussed in the article? I wonder how much of a difference this makes to SMT on/off.
  • bwj - Thursday, December 3, 2020 - link

    This article ignores the important viewpoint from the server side. If I have several independent, orthogonal workloads scheduled on a computer I can greatly benefit from SMT. For example if one workload is a pointer-chasing database search type of thing, and one workload is compressing videos, they are not going to contend for backend CPU resources at all, and the benefit from SMT will approach +100%, i.e. it will work perfectly and transparently. That's the way you exploit SMT in the datacenter, by scheduling orthogonal non-interfering workloads on sibling threads.

Log in

Don't have an account? Sign up now