Another year, another TechDay from Arm. Over the last several years Arm’s event has come as clockwork in the May timeframe and has every time unveiled the newest flagship CPU and GPU IPs. This year is no exception as the event is back on the American side of the Atlantic in Austin Texas where Arm has one of its major design centres.

Two years ago during the unveiling of the Cortex A73 I had talked a bit more about Arm’s CPU design teams and how they’re spread across locations and product lines. The main design centres for Cortex-A series of CPUs are found in Austin, Texas; Cambridge, the United Kingdom, and Sophia-Antipolis in the south of France near Nice. For the last two years the Cortex A73 and Cortex A75 were designs that mainly came out of the Sophia team while the Cortex A53 and more recently the A55 were designs coming out of Cambridge. This means that we haven’t seen any recent designs coming out of Austin and the last of the “Austin family” of CPUs were the A57 and A72.

The project being worked on in Austin had been hyped up for several years – I remember even as early as the A73 release back in 2016 the company had pulled forward some elements from an advanced future microarchitecture on the back-end pipelines, especially on the FP/SIMD side. The Cortex A75 was further remarked as pulling more elements from this new mysterious project.

Today we can finally unveil what the Austin team has been working on – and it’s a big one. The new Cortex A76 is a brand new microarchitecture which has been built from scratch and lays the foundation for at least two more generations for what I’ll call “the second generation of Austin family” of CPUs.

The Cortex A76 is important for Arm for a design perspective as it represents a new start from a clean sheet. It’s rare for IP claim to be able to do this as it represents a great resource and time investment and if it weren’t for the Sophia design team taking over the steering wheel for the last two generations of products it wouldn’t have been reasonable to execute. The execution of the CPU design teams should be emphasised in particular as Arm claims this is the 5th generation “annual beat” product where the company delivers a new microarchitecture every new year. Think of it as an analogue to Intel’s past Tick-Tock strategy, but rather Tock-Tock-Tock for Arm with steady CAGR (compound annual growth rate) of 20-25% every generation coming from µarch improvements.

So what is the Cortex A76? In Arm’s words, it’s a “laptop-class” performance processor with mobile efficiency. The vision of the A76 as a laptop-class processor had been emphasised throughout the TechDay presentation so it seems Arm is really taking advantage of the large performance boost of the IP to cater to new market segments such as the emerging “Always connected PCs” which Qualcomm is spearheading with their SoC platforms.

The Cortex A76 microarchitecture has been designed with high performance while maintaining power efficiency in mind. Starting from a clean sheet allowed the designers to remove bottlenecks throughout the design and to break previous microarchitectural limitations. The focus here was again maximum performance while remaining within energy efficiency that is fit for smartphones.

In broad metrics, what we’re promised in actual products using the A76 is the follows: a 35% performance increase alongside 40% improved power efficiency. We’ll also see a 4x improvements in machine learning workloads thanks to new optimisations in the ASIMD pipelines and how dot products are handled. These figures are baselined on A75 configurations running at 2.8GHz on 10nm processes while the A76 is projected by Arm to come in at 3GHz on 7nm TSMC based products.

The new CPU is naturally still compatible with DynamIQ’s common cluster topology and Arm envisions designs to be paired with Cortex A55s as the little more power efficient CPUs. The configuration scalability of the DynamIQ IP again was reiterated and we were presented with example configurations such as 1+7 or 2+6 with either Cortex A75 or A76 CPU IP. This presentation slide was one of the rare ones where Arm referred to the area size of the A76, pointing out that the A75 still had better PPA and thus might still be a valid design choice for companies, depending on their needs. One comparison that was made during the event is that in terms of area, three A76’s with larger caches would fit inside the size of a Skylake core – all while within 10% of the IPC of the Intel CPU, but obviously there’s also process node scaling considerations to take into account.

A standout claim is that Arm aims to outperform the competition at half the area and half the power. Arm was slightly beating around the bush here in what it considers the competition, but generally the answer was that it was considering everybody the competition. Taking into account Intel, AMD or Samsung it’s actually not that hard to imagine Arm beating them in PPA as historically the company always had the smallest CPU designs and that directly translates into more efficient microarchitectures.

Before we get into more detailed breakdowns of the performance and power improvements and what I’m expecting to happen into products, let’s see the microarchitectural improvements on the core and how Arm managed to extract this much performance while maintaining power efficiency.

Cortex A76 µarch - Frontend
Comments Locked

123 Comments

View All Comments

  • techconc - Tuesday, June 5, 2018 - link

    " A11 is no match in speed and performance versus SD845 and Exynos powered Android phones today"

    Huh? Benchmarks do not support your claim.
  • name99 - Friday, June 1, 2018 - link

    This assumption makes two mistakes.

    The first is to assume that ONE metric (in this case 4-wide front end) is the PRIMARY determinant of performance. Even Apple's (A11) IPC (over a wide range of code) is about maybe 2.7. This means on average less than 3 of those 6 execution units are being used per cycle. IF other parts of the core uncache could be PERFECTED on a 4 wide design so that EVERY cycle 4 instructions executed, it would clearly surpass the A11 in IPC.
    The problem, of course, is just how hard it is to prevent cycles where NOTHING executes and cycles where only a few (one or two) instructions execute. Reducing these are where most of the magic is --- and you won't see details of that it in an article like this; rather it's in that painstaking rooting out hundreds of small inefficiencies that the article talked about.

    To give just one example -- no-one is talking about the clustered page tables. This is a very cool idea which relies on the fact that most of the time the OS page allocator allocates a number of pages contiguously in virtual AND physical space, and with the same permissions. If that is so, the same page entry can correspond to multiple contiguous pages (in the academic literature, usually up to 8). This gives you a substantial increase in TLB reach at a very minor increase in TLB bits.
    (I can find no info as to whether Intel does this. I SUSPECT Apple used to do this in their earlier [and probably even A11] cores. There are very recent even better ideas in the academic literature that might, perhaps, have made it to the A12 core.)

    Second mistake you make is to ignore frequency. A9 ran at 1.85 GHz, A10 at 2.35 GHz. The A76 will likely run at 3GHz.
  • tipoo - Tuesday, September 4, 2018 - link

    And yet, here we are with single core results at around 60% that of the A11. Taking their own numbers at face value, a 56% increase over the A73 in GB4 results in 2800.

    Yes, I used a very simplistic one dimensional comparison, and there's a whole lot more to it. However, core complexity does go up almost exponentially with width, and so it does point to what ballpark they were aiming at. A76 was never going to beat the A11 per core because it was never aimed at it.
  • colinisation - Thursday, May 31, 2018 - link

    Hi Andrei,

    Is this core the one referred to as Ares on roadmaps?

    Been waiting years for this one if it is.
  • Andrei Frumusanu - Thursday, May 31, 2018 - link

    Yes in practical terms - no in actual terms. You'll likely hear more about this in the future.
  • tuxRoller - Friday, June 1, 2018 - link

    Ok, now you've incepted the idea that ARM is going to announce a dedicated server-class chip (maybe even a tease of SVE.....)
  • name99 - Friday, June 1, 2018 - link

    I agree with your point (ARM will release a server chip, essentially based on this core).
    Remember that GB4 is scaled to 4000 represents an i7-6600U (Skylake, 3.4GHz, 4MiB L3, 15W).
    So A76 is essentially at that level (slightly worse FP, but many server tasks will not care).
    To the extent that that Skylake at 3.4GHz is an acceptable Server class core, ARM could dump some large number of A76 on a die and be in the same sort of space as dearly-departed Centriq and ThunderX2.

    They likely would have to beef up their NoC one way or another, and tweak the caching and memory systems, the usual server additions.
    But I assume they didn't put all that effort into "lowest possible latency for hypervisor activity" on the theory that hypervisor performance on smartphones is THE next big thing...
  • joe_85 - Thursday, May 31, 2018 - link

    I am sure people will disagree with me because people love to argue. Andrei, I like your writing and overall thoroughness but a few critiques here. The charts you make are extremely unpleasant to look at and do not lend themselves to a quick assessment of the data.

    First of all the color coded stripes in the legend for the A76 projections is not even decipherable, and the actual bars are on the chart are difficult to see. Secondly, why are you color coding them at all? Just put processor names to the left of the bars and the benchmark name above the bars.

    Additionally are the bars in any particular order? If they are I certainly can't tell, they should be relative to the performance OR the efficiency.

    Other constructive criticism would be that adding some additional subheadings within your articles would make it feel like a more solid piece.

    Keep up the good work.
  • jospoortvliet - Wednesday, June 6, 2018 - link

    Loving to argue or not, the performance vs efficiency graphs are rather unique and very clever, I have not seen any design that so clearly shows how different SOC's compare at both. Yes it takes a few mins before you can read them I am sorry that the world is so complicated. But they work very well if you just use that gray matter a bit.
  • syxbit - Thursday, May 31, 2018 - link

    As an Android user, I continue to be disappointed with QCOMM, Arm (and Nvidia for dropping out) at how far ahead Apple in single threaded perf.

Log in

Don't have an account? Sign up now