SMT4 Details - Four Threads Per Core

One of the things that makes the Thunder line-up stand out from the competition is its inclusion of 4-way SMT, meaning that each core can execute up to 4 threads.

 

Each thread is viewed from the OS as a fully independent CPU and each has its own independent Arm architecture state, sharing the vast majority of the core’s resources bar a very few exceptions such as the aforementioned Skid Buffer.

The microarchitecture had always been multi-threaded, but Marvell went ahead to re-account for the area impact of SMT and discloses that it only takes about 5% of a core.

The company further details some of the mechanisms of its SMT, such as its arbitration mechanism between threads. During the fetch stage for example, the core will pick the thread which currently has the least amount of instructions live in the core’s pipelines, ensuring that the number of micro-ops and instructions further down the pipeline are balanced between the threads. We see a similar logic in the dispatch stage, and the thread with the fewest instructions downstream in the pipeline is picked out of the Skid Buffer.

The back-end has no notion of threads and simply executes the micro-ops which are oldest first. Retiring happens with priority in regards to the threads that have most backed up instructions for retiring.

Marvell says that this thread arbitrations works quite well on most codes, with the execution latencies between threads being quite uniform.

The speed-up that SMT can bring to the table is reversely correlated with the IPC of a given workload, meaning that a low IPC workload will see the biggest improvements with SMT. Other way to describe this is data-plane centric workloads which have a high latency to data fetching for execution are better suited to hide these kind of bottlenecks and idle-periods of the core through SMT.

Low-IPC workloads such as databases see a quite big gain in IPC and performance reaching up to 2x for 4 threads. Higher IPC workloads with a smaller data footprint will see more limited benefit to IPC.

Translating this to socket-level performance, we see a great scaling up to 60 cores which is essentially the physical core count of the processor, and a more sub-linear, but still quite respectable scaling up to 240 threads. Performance from 60 to 240 threads increases by roughly 60% which is a nice gain considering the very low area impact of SMT4 on Marvell’s cores.

When asked about how its ThudnerX3 is positioned against the competition, Marvell says that against Intel based products the company will slightly lag behind in single-thread performance, but will offer vastly greater multi-threaded throughput. Against AMD (we assume Rome), the TX3 is said to perform better in single-threaded performance with AMD taking the lead in workloads with low data sharing, although the TX3 to do better in workloads with more data-sharing such as database applications. Graviton2 is said to be a very good chip, although it offered low frequency and no threading support, so those are the areas the TX3 would be better in.

Overall, the TX3 seems like a solid candidate in the current server space, however I don’t feel like it differentiates itself very much aside from the fact that it offers SMT support. I feel like the CPU’s microarchitecture is still quite narrow, and although the IPC improvements are generationally good, Marvell also has significantly longer time between releases than Arm. In that regard, only slightly beating the Graviton2 here doesn’t seem enough and I do expect Altra-based designs to be faster.

We’ll have to see how the ThunderX3 ends up in terms of performance and power efficiency, but aside from dataplane heavy workloads that can fully take advantage of SMT, I feel like it might be a too close for comfort race for Marvell. For the consumer and enterprises, it’s exciting either way as this means we’ll have a ton of viable options in the near future.

Related Reading:

Triton CPU Core - 30% Generational IPC Improvements
POST A COMMENT

27 Comments

View All Comments

  • Spunjji - Wednesday, August 19, 2020 - link

    Good to know your opinions on the future of the CPU market are just as balanced, nuanced and well-informed as your political ramblings... Reply
  • Quantumz0d - Wednesday, August 19, 2020 - link

    Better go back to your twitter and reeesetera and put more pronouns.
    And you don't even have any argument, you are stuck on that political comment. And you will be stuck there forever.
    Reply
  • Gomez Addams - Wednesday, August 19, 2020 - link

    Your diatribe entirely missed the point of why people are moving to ARM-based processors for servers and other purposes. In can be summarized in two words : power consumption. Server farms can save a lot of money using ARM processors compared to the equivalent horsepower from just about any other processor available. They are not moving to them for any performance advantage. Reply
  • Wilco1 - Thursday, August 20, 2020 - link

    Lower power also means less cooling, fewer power supplies and higher density. Additionally Arm servers need less silicon area to get the same performance, so upfront cost of server chips is lower too (you avoid paying extortionate prices like you do for many x86 server chips). Reply
  • eek2121 - Tuesday, August 18, 2020 - link

    The x86 market is not shrinking. This server offers no benefits over a modern AMD or Intel server. Reply
  • name99 - Tuesday, August 18, 2020 - link

    Two years ago you could reasonably have said "there is no plausible ARM server".
    A year ago you could legitimately have said "sure, there are ARM servers (TX2, Graviton) but they suck".
    This year the best you can say is "they offer no benefit over a modern AMD or Intel server" (actually already not true if you're buying compute from AWS).

    You want to bet against this trajectory?

    Next year? This was the year of matching x86 along important dimensions. Next year will be the year of exceeding x86 along important dimensions. Not ALL dimensions, that might take till 2022 or so, (there's still a reasonable amount of foundational work to do by all parties, like implementing SVE/2) but, as I said, the trajectory is clear.
    Reply
  • Spunjji - Wednesday, August 19, 2020 - link

    I have to agree with this assessment. People keep counting ARM designs out because they've taken a long time to ramp up to this level, but every year they get closer to being a notable force in the market, and every year the naysayers find another, smaller reason to point to for they'll never be successful.

    The simple truth is that ARM designs don't even have to beat x86 to take a slice of the market - they just have to offer *something*, be it cost benefits, lower idle power, improved security, or even just being an in-house design (a-la Amazon).
    Reply
  • Spunjji - Wednesday, August 19, 2020 - link

    Your last statement doesn't follow from - or lead to - the first one. Reply
  • Spunjji - Wednesday, August 19, 2020 - link

    Shrinking? Sure, eventually. Fast? Not so sure.

    AWS transitioning makes sense for their own use, but they'll more than likely need to continue offering x86 for customers. Same goes for Google and Microsoft. Hard to predict how that will shake out at this juncture.

    Apple aren't even close to 10% of the total x86 market, either - they're between 7.5% and 8% of the global *PC* market, which obviously doesn't include the server / datacentre market. That's still going to be a bit of a dent for Intel when the transition completes, but it's not nearly as bad for x86 on the whole as you're implying.

    Competition is heating up, though. That's a good thing.
    Reply
  • Gomez Addams - Tuesday, August 18, 2020 - link

    This illustrates why I think Nvidia wants to own Arm. They have already stated they are porting CUDA to the ARM instruction set. I think this is because they want to make a processor suited for HPC and it will be ARM-based and here's why. First, think of how their GPUs are organized. They use streaming multiprocessors with a whole bunch of little cores. These days they have 64 cores per SM so that is essentially 64-way SMT. The thing is these cores are very, very simple with many limitations. I think they want to use ARM-based cores with something like 16-way SMT. If they use AMD's multi-chip approach they could make an MCM with a thousand ARM cores in one package. There would be no CPU-GPU pairing as we often see today. One MCM could run the whole show. This would entirely eliminate the secondary data transfers to a co-processor and make for an incredibly fast super computer with relatively low power consumption. I think this architecture would be a huge improvement over what they have. Reply

Log in

Don't have an account? Sign up now