Testing the Cortex-X2: A New Android Flagship Core

Improving on the Cortex-X1 by switching to the Arm v9 architecture and increasing the core resources, both Arm and Qualcomm are keen to promote that the Cortex-X2 offers better performance and responsiveness than previous CPU cores. The small frequency bump from 2.85 GHz to 3.00 GHz will add some of that performance, however the question is always if the new manufacturing process coupled with the frequency increase allows for better power efficiency when running these workloads. Our standard analysis tool here is SPEC2017.

Running through some of these numbers, there are healthy gains to the core, and almost everything has a performance lift.

On the integer side (from 500.perlbench to 557.xr), there are good gains for gcc (+17%), mcf (+13%), xalancbmk (+13%), and leela (+14%), leading to an overall +8% improvement. Most of these integer tests involve cache movement and throughput, and usually gains in sub-tests like gcc can help a wide range of regular user workloads.

Looking at power and energy for the integer benchmarks, we’re seeing the X2 consume more instantaneous power on almost all the tests, but the efficiency is kicking in. That overall 8% performance gain is taking 5% less total energy, but on average requires 2% more peak power.

If we put this core up against all the other performance cores we test, we see that 8% jump in performance for 5% less energy used, and the X2 stands well above the X1 cores of the previous generation, especially those in non-Snapdragon processors. There is still a fundamental step needed to reach the Apple cores, even the previous-generation A14 performance core, which scores 34% higher for the same energy consumed (albeit on average another 34% peak power).

Just on these numbers, Qualcomm’s +20% performance or +30% efficiency doesn’t bare fruit, but the floating point numbers are significantly different.

Several benchmarks in 2017fp are substantially higher on the X2 this generation. +17% on namd for example would point to execution performance increases, but +28% in parest, +41% in lbm and +20% in blender showcases a mix of execution performance and memory performance. Overall we’re seeing +19% performance, which is nearer Qualcomm’s 20% mark. Note that this comes with an almost identical amount of energy consumed relative to the X1 core in the S888, with a difference of just 0.2%.

The major difference however is the average power consumed. For example, our biggest single test gain in 519.lbm is +41%, but where the S888 averages 4.49 watts, the new X2 core averages 7.62 watts.  That’s a 70% increase in instantaneous power consumer, and realistically no single core in a modern smartphone should draw that much power. The reason why the power goes this high is because lbm leverages the memory subsystem, especially that 6 MiB L3 cache and relies on the 4 MiB system level cache, all of which consumes power. Overall in the lbm test, the +41% performance costs +20% energy, so efficiency is still +16% in this test. Some of the other tests, such as parest and blender, also follow this pattern.

Comparing against the competition, the X2 core does make a better generation jump when it comes to floating point performance. It will be interesting to see how other processors enable the X2 core, especially MTK’s flagship at slightly higher frequency, on TSMC N4, but also if it has access to a full 14 MiB combination of caches as we suspect, that could bring the power draw during single core use a lot higher. It will be difficult to tease out exactly who wins what where based on implementation vs. process node, but it will be a fun comparison to make when we look purely at the X2 vs. X2 cores.

Unfortunately due to how long SPEC takes to run (1h30 on the X2), we were unable to test on the A710/A510. We’ll have to wait to see when we get a retail unit.

The Snapdragon 8 Gen 1 Machine Learning: MLPerf and AI Benchmark 4
POST A COMMENT

174 Comments

View All Comments

  • Alistair - Tuesday, December 14, 2021 - link

    Their GPU is great, their CPU is 3 years behind now, and this improvement over last year is almost nill. Sigh. Sad. Reply
  • Raqia - Tuesday, December 14, 2021 - link

    I'm looking forward to the '23 Nuvia designed cores for laptop compute, so let's see what they can do.

    However, I think it's perfectly fine if they went with a smaller ARM solution for future phone SoCs: in the SG81 and the 888 they consciously chose to limit L3 cache sizes in their CPU complex (and hence single threaded performance) for 2 generations from the biggest possible to dedicate die area and power consumption for other higher impact purposes. To me, an ideal Apple phone would have 6 of their small cores so they can dedicate more die area to their GPU, NPU and ISPs.

    John Carmack himself thought it best to throttle the CPU to half of maximum clock speed in even the XR2 (which is a 865 derivative using a faster clocked, bigger cache A77 as its biggest core):

    https://twitter.com/ID_AA_Carmack/status/130662113...
    https://twitter.com/ID_AA_Carmack/status/131878675...

    but he praised the XR2 in no uncertain terms, calling it "a lot of processing"

    https://youtu.be/sXmY26pOE-Y?t=1972

    It is indeed the DSPs and the GPUs doing the heavy lifting in the VR use case; I don't see it being much different for phones where wireless data rates are by far the biggest bottleneck.

    The CPU benches you see headlining many web SoC reviews matter only for the benchmark obsessed, but pretty much no one else.
    Reply
  • Alistair - Tuesday, December 14, 2021 - link

    Your throttling argument doesn't make sense when the iPhone is more efficient also. You can run an iPhone at Snapdragon speed, and then you use way less power. Reply
  • Raqia - Tuesday, December 14, 2021 - link

    If you look at the efficiency curves for the A77 and the A13's big CPU, they're pretty danged close:

    https://miro.medium.com/max/1155/1*U7qA0vDhixGAYes...

    The bigger point is, for phones it's well past the point of diminishing returns to pin the A13 CPU where most benchmarks do since it's simply not a bottleneck in realistic workloads. You can make a very fast CPU for bench-marketing purposes and get semi-technical people excited about your SoC, but you won't need to go very far along the curve to both hit its "knee" and have excellent performance.

    Apple made the core for laptops and desktops (for which it's well suited) but included it in its iPhone for marketing purposes rather than to address actual performance needs. Some cite the fact that more apps are coded in Javascript and websites are more Javascript intensive these days, but by far the bigger culprit in responsiveness is data connectivity and they were happy to use Intel's inferior modems behind the scenes while trotting out big but irrelevant Geekbench scores. Furthermore, part of their battery-gate issue stems from the huge possible current draw of their CPUs, which while efficient still use high peak power and current.

    Qualcomm has certainly been worse in efficiency and performance across multiple SoC processing blocks for the past two generations due to switching to Samsung as its premium SoC fab, and I certainly have no kind words for them in making that decision. However, given what they had to work with in terms of die area and power draw, they did make the correct decision in de-emphasizing the CPU block for relatively more grunt in the other blocks.
    Reply
  • ChrisGX - Thursday, December 16, 2021 - link

    Yes, that's right, but Samsung's inadequate process nodes are primarily responsible for Snapdragon parts (and all premium mobile SoCs based on licensed ARM IP) falling further behind. (Note: ARM SoCs are still seeing notable improvements in the execution rate of floating point workloads even as integer performance wallows.) For that reason, it will be very interesting to see how the TSMC fabbed MediaTek Dimensity 9000 acquits itself.

    The more telling part of this story, I think, is the failure of ARM and ARM licensees to manage this transition to high performance mobile SoCs while maintaining energy efficiency leadership. In the mobile phone world, today, Apple not only wears the performance crown but the energy efficiency crown as well.
    Reply
  • Wilco1 - Saturday, December 18, 2021 - link

    There are claims that Dimensity 9000 has ~49% better perf/W than SD8gen1: https://www.breakinglatest.news/business/tsmcs-4nm...

    That means the efficiency gap was indeed due to process as suspected. There is definitely an advantage in using the most advanced process 1 year before everyone else.
    Reply
  • Raqia - Saturday, December 18, 2021 - link

    Really good to see them pick up their game: bigger L3 cache and faster clocked middle cores seem to be part of the reason efficiency and multicore performance are up as well aside from process.

    Some rumors indicate the dual sourced version of the S8G1 (SM8475) may be more efficient than the samsung node fabbed version but not as much as expected. It seems like Qualcomm picks different sub-blocks to optimize with each generation: this gen it was most certainly the GPU. Looks like the CPU block can be expected to languish until they bring up the NUVIA designed cores likely in '24. As their initial focus was servers, NUVIA may not have had a suitable small core in the pipeline for '23 which is much more important for mobile than laptop scale devices.
    Reply
  • Wilco1 - Sunday, December 19, 2021 - link

    Yes it looks like Mediatek have done a great job. The larger caches should help power efficiency as well indeed. It will be interesting to see how the larger L3 and system cache compare with the Snapdragon and Exynos in AnandTech's benchmarks. Reply
  • Kamen Rider Blade - Tuesday, December 14, 2021 - link

    I wonder how much more performance Android would gain by going with C++ instead of Java.

    https://benchmarksgame-team.pages.debian.net/bench...

    There's ALOT of performance to be gained by going with C/C++/Rust.

    The fact that Android went with Java for it's primary programming language while Apple went with a C/C++ derivative could be what explains the large gap.
    Reply
  • jospoortvliet - Wednesday, December 15, 2021 - link

    Might make a difference in day to day use but not in these benchmarks as they already use native code. Reply

Log in

Don't have an account? Sign up now