Hot Chips 2020: Marvell Details ThunderX3 CPUs - Up to 60 Cores Per Die, 96 Dual-Die in 2021by Andrei Frumusanu on August 17, 2020 4:30 PM EST
- Posted in
- Enterprise CPUs
SMT4 Details - Four Threads Per Core
One of the things that makes the Thunder line-up stand out from the competition is its inclusion of 4-way SMT, meaning that each core can execute up to 4 threads.
Each thread is viewed from the OS as a fully independent CPU and each has its own independent Arm architecture state, sharing the vast majority of the core’s resources bar a very few exceptions such as the aforementioned Skid Buffer.
The microarchitecture had always been multi-threaded, but Marvell went ahead to re-account for the area impact of SMT and discloses that it only takes about 5% of a core.
The company further details some of the mechanisms of its SMT, such as its arbitration mechanism between threads. During the fetch stage for example, the core will pick the thread which currently has the least amount of instructions live in the core’s pipelines, ensuring that the number of micro-ops and instructions further down the pipeline are balanced between the threads. We see a similar logic in the dispatch stage, and the thread with the fewest instructions downstream in the pipeline is picked out of the Skid Buffer.
The back-end has no notion of threads and simply executes the micro-ops which are oldest first. Retiring happens with priority in regards to the threads that have most backed up instructions for retiring.
Marvell says that this thread arbitrations works quite well on most codes, with the execution latencies between threads being quite uniform.
The speed-up that SMT can bring to the table is reversely correlated with the IPC of a given workload, meaning that a low IPC workload will see the biggest improvements with SMT. Other way to describe this is data-plane centric workloads which have a high latency to data fetching for execution are better suited to hide these kind of bottlenecks and idle-periods of the core through SMT.
Low-IPC workloads such as databases see a quite big gain in IPC and performance reaching up to 2x for 4 threads. Higher IPC workloads with a smaller data footprint will see more limited benefit to IPC.
Translating this to socket-level performance, we see a great scaling up to 60 cores which is essentially the physical core count of the processor, and a more sub-linear, but still quite respectable scaling up to 240 threads. Performance from 60 to 240 threads increases by roughly 60% which is a nice gain considering the very low area impact of SMT4 on Marvell’s cores.
When asked about how its ThudnerX3 is positioned against the competition, Marvell says that against Intel based products the company will slightly lag behind in single-thread performance, but will offer vastly greater multi-threaded throughput. Against AMD (we assume Rome), the TX3 is said to perform better in single-threaded performance with AMD taking the lead in workloads with low data sharing, although the TX3 to do better in workloads with more data-sharing such as database applications. Graviton2 is said to be a very good chip, although it offered low frequency and no threading support, so those are the areas the TX3 would be better in.
Overall, the TX3 seems like a solid candidate in the current server space, however I don’t feel like it differentiates itself very much aside from the fact that it offers SMT support. I feel like the CPU’s microarchitecture is still quite narrow, and although the IPC improvements are generationally good, Marvell also has significantly longer time between releases than Arm. In that regard, only slightly beating the Graviton2 here doesn’t seem enough and I do expect Altra-based designs to be faster.
We’ll have to see how the ThunderX3 ends up in terms of performance and power efficiency, but aside from dataplane heavy workloads that can fully take advantage of SMT, I feel like it might be a too close for comfort race for Marvell. For the consumer and enterprises, it’s exciting either way as this means we’ll have a ton of viable options in the near future.
- Hot Chips 2020 Live Blog: Marvell ThunderX3 (10:30am PT)
- Marvell Announces ThunderX3: 96 Cores & 384 Thread 3rd Gen Arm Server Processor
- Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last
- Marvell Unveils its Comprehensive Custom ASIC Offering
- Next Generation Arm Server: Ampere’s Altra 80-core N1 SoC for Hyperscalers against Rome and Xeon
- Ampere’s Product List: 80 Cores, up to 3.3 GHz at 250 W; 128 Core in Q4
- Amazon's Arm-based Graviton2 Against AMD and Intel: Comparing Cloud Compute
Post Your CommentPlease log in or sign up to comment.
View All Comments
Tomatotech - Monday, August 17, 2020 - linkSMT4 is interesting, 60% speedup from 5% area. Makes me think SMT8 might just be worth exploring in the next generation, even if it is only 30% speedup from 10% area.
It’s been a while since I studied this, but saying it has 90MB L3 cache might be regarded as misleading. If you try stuffing your 50MB of code and data into cache, it’s going to fall over quite spectacularly. Each 4-core tile has only 6MB cache, so you have to keep your code and data to only 6MB or less if you want it to fit into cache. (Or fiddle around with splitting it between tiles). Better to say it has 15x6MB L3 cache (Still damn good) and leave it at that?
saratoga4 - Monday, August 17, 2020 - link>Each 4-core tile has only 6MB cache
According to the slides linked above the L3 cache lines have no affinity for any specific core, so it really is a single 90MB L3, similar to how Intel's ring bus parts work.
Tomatotech - Monday, August 17, 2020 - linkYou have to be careful about the marketing wording. What you said is what they want you to think - that any core can use any cache equally well - but a close reading indicates it could mean just 'any of the 4 cores in a single tile can use any of the 2x3MB caches associated with that tile'. Hence my use of 6MB.
Even worse - a single core might not be able to use all of both on-tile caches at the same time, so the real max cache for a core might be as low as 3MB. This chip is designed to be a multi-thread monster with up to 240 threads per die, each thread accessing its own part of cache. Not to have a tiny number of threads using up all the cache. That's a different layout.
saratoga4 - Monday, August 17, 2020 - link>but a close reading indicates it could mean just 'any of the 4 cores in a single tile can use any of the 2x3MB caches associated with that tile'.
The slides say that individual tiles are striped, so I don't think your reading is correct.
Krysto - Tuesday, August 18, 2020 - linkPower10 is already doing SMT8.
GreenReaper - Wednesday, August 19, 2020 - linkIt talks about the increase in area, but not power usage. Moreover you risk effectively reducing single-threaded speed unless you upgrade every other bit (and probably even then). If you limit them further it starts to look more and more like a GPU.
I think we need more evidence that people can use SMT4 effectively, as they have done eventually with Quad-Core and beyond, before it is increased further.
senttoschool - Tuesday, August 18, 2020 - linkBoth AMD and Intel are in trouble. The x86 market is shrinking fast.
AWS is deadset on transitioning to its own ARM server chips so they can reduce cost, build chips for their own needs, and have unique features. Google Cloud and Microsoft Azure will probably follow shortly in order to compete.
With Apple's transition to ARM, the x86 market shrank by 10% overnight. In addition, MacOS ARM will spur a renewed push for developers to optimize for ARM (Windows or Mac) on the laptop/desktop. This means Windows ARM will probably be an extremely viable option in the near future.
The x86 market isn't dead. But it's shrinking. AMD and Intel aren't just competing against each other, they're also competing with Apple, Qualcomm, Marvell, ARM, Nuvia, Ampere, Samsung, Huawei, Alibaba, etc.
Quantumz0d - Tuesday, August 18, 2020 - linkLOL
ARM is always stupid custom, Apple trash is the least bothered part for Apple themselves as their Mac share of revenue is under 10%, that's why Apple did the move because instead of paying Intel and getting caught in the cheap VRM trash news they would benefit from the new hybrid OS of Mac which is an abomination to desktop UX. Apple transition shranked x86 marketshare overnight ? hahah, like Apple said themselves it's going to be a 2 year period so the products are still there. By that time you think AMD and Intel both DC centric companies sit and lay eggs ? ARM is dogshit, it's always fucking custom the massive support of x86 in Linux space is not there for ARM so transitioning to that is not possible at this point of time, x86 already is a RISC underneath it, so your yapping of Intel and AMD in trouble ain't true. They sure have to make sure they are innovating, Intel is going Big Little for the mobile space, AMD patented the same and they will go the same, and with massive Windows ecosystem this ARM bullshit failed on RT and x86 to ARM translation is not for free, you get performance penalty. Apple pays top companies like Adobe to make their Software like first party IP, it's not a major market.
AT spec scores out of that graphs are meaningless. When the work done by Snapdragon processors is equivalent to what A series do, just because the OS feels snappy due to tight integration and closed binaries is not a measure but real life work performance.
All those companies, none of them make Datacenter processors except Marvell and with Apple being only consumer centric company what the fuck are you blabbing here ? Datacenter market is owned by x86 not ARM, and Qualcomm left after pouring billions in their Centriq prized custom ARM IP where Cloudflare was heralding just like how again CF is now doing it again with Altera. AWS is going Graviton because it's always Amazon's one of their own type in house stuff to save money and it's not going to beat AMD which is the king of the DC with EPYC 7742. Icelake SP is also coming and Intel already probably locked out vendors from moving to AMD by now forget ARM.
ARM is again, full custom. Qualcomm is the only one which properly respects the OSS by CAF and Binaries, Blobs etc on Android space. Samsung is a failure, Huawei is utter trash in that dept. Nuvia is vaporware until real product and that's only for DC market, not Smartphones or ultra portables, which is where lot of money is there to start up quick. And on Windows it's utter trash and same for Linux, only low level workloads. Name one ARM processor based machine which can run anything like x86 Wintel machines ?
HVAC - Tuesday, August 18, 2020 - linkUtter ... trash ...
Greys - Wednesday, August 26, 2020 - linkI am agree with you!