The Cortex-A77 µarch: Added ALUs & Better Load/Stores

Having covered the front-end and middle-core, we move onto the back-end of the Cortex-A77 and investigate what kind of changes Arm has made to the execution units and data pipelines.

On the integer execution side of the core we’ve seen the addition of a second branch port, which goes along with the doubling of the branch-predictor bandwidth of the front-end.

We also see the addition on an additional integer ALU. This new unit goes half-way between a simple single-cycle ALU and the existing complex ALU pipeline: It naturally still has the ability of single-cycle ALU operations but also is able to support the more complex 2-cycle operations (Some shift combination instructions, logical instructions, move instructions, test/compare instructions). Arm says that the addition of this new pipeline saw a surprising amount of performance uplift: As the core gets wider, the back-end can become a bottleneck and this was a case of the execution units needing to grow along with the rest of the core.

A larger change in the execution core was the unification of the issue queues. Arm explains that this was done in order to maintain efficiency of the core with the added execution ports.

Finally, existing execution pipelines haven’t seen much changes. One latency improvement was the pipelining of the integer multiply unit on the complex ALU which allows it to achieve 2-3 cycle multiplications as opposed to 4.

Oddly enough, Arm didn’t make much mention of the floating-point / ASIMD pipelines for the Cortex-A77. Here it seems the A76’s “state-of-the-art” design was good enough for them to focus the efforts elsewhere on the core for this generation.

On the part of the load/store units, we still find two units, however Arm has added two additional dedicated store ports to the units, which in effect doubles the issue bandwidth. In effect this means the L/S units are 4-wide with 2 address generation µOps and 2 store data µOps.

The issue queues themselves again have been unified and Arm has increased the capacity by 25% in order to expose more memory-level parallelism.

Data prefetching is incredibly important in order to hide memory latency of a system: Shaving off cycles by avoiding to having to wait for data can be a big performance boost. I tried to cover the Cortex-A76’s new prefetchers and contrast it against other CPUs in the industry in our review of the Galaxy S10. What stood out for Arm is that the A76’s new prefetchers were outstandingly performant and were able to deal with some very complex patterns. In fact the A76 did far better than any other tested microarchitecture, which is quite a feat.

For the A77, Arm improved the prefetchers and added in even new additional prefetching engines to improve this even further. Arm is quite tight-lipped about the details here, but we’re promised increased pattern coverages and better prefetching accuracy. One such change is claimed to be “increased maximum distance”, which means the prefetchers will recognize repeated access patterns over larger virtual memory distances.

One new functional addition in the A77 is so called “system-aware prefetching”. Here Arm is trying to solve the issue of having to use a single IP in loads of different systems; some systems might have better or worse memory characteristics such as latency than others. In order to deal with this variance between memory subsystems, the new prefetchers will change the behaviour and aggressiveness based on how the current system is behaving.

A thought of mine would be that this could signify some interesting performance improvements under some DVFS conditions – where the prefetchers will alter their behaviour based on the current memory frequency.

Another aspect of this new system-awareness is more knowledge of the cache pressure of the DSU’s L3 cache. In case that other CPU cores would be highly active, the core’s prefetchers would see this and scale down its aggressiveness in order to possibly avoid thrashing the shared cache needlessly, increasing overall system performance.

The Cortex-A77 µarch: Going For A 6-Wide* Front-End Performance: 20-35% Better IPC, End Remarks
Comments Locked

108 Comments

View All Comments

  • Retycint - Wednesday, May 29, 2019 - link

    This isn't 2012 anymore. A 30% better performance (for instance) isn't going to lead to any real world differences, especially given the fact that most consumers use their phones as a camera/social media machine
  • jackthepumpkinking6sic6 - Thursday, May 30, 2019 - link

    How foolish to actually sit there and act as if that's the only option. First of not only are those not the only high end option but they clearly said lower cost. Meaning any segment. Even mid and low range. Use your brain before commenting.
    Not to mention that despite being similarly priced and having insignificantly different benchmark scores those devices are overall better and more worthy the price.... Though none are worthy of such prices. Just some are more worth it than others.
  • alysdexia - Monday, December 30, 2019 - link

    performance -> speed
    Anandtech never explain how they get their power figures; I saw one mention of regression testing under the iPhone XS review but still no work. The figures look more like shared power or peak power than average CPU power as they conflict with general runtime or battery drain tests which suggest 2 watts sustained; I recently took Notebookcheck's loads of power figures to revise my list of the thriftiest CPUs where I found the equivalent TDP somewhere between the load and idles and implied screen, GPU, and memory powers; another way to estimate is to subtract the nonCPU from the power adapter rating which for iPhones is 5W, screen 1W to 2W. I had to throw out Anandtech's SPEC2006 powers.

    Androids do not get thriftier chips; iPhones idle better than the average (Use the comparison tool under any Notebookcheck review) and their huge cache seems to save power. (iPhone 11 has 33MB vs. S10 5MB. This makes A13 over 4fold as good as 9820.)

    W CPU/(CPU+GPU): select core–unit CPUmark [mobile/60] (Gn/s) {Mp/s/10; LZMA-D Mp/s/10}, Geekbench 5, UserBenchmark Int 2019 Dec: /W, /$
    ~1·2: Cortex-A77 A13, [11145] (29·3), 1330–3422, : [~9288], ; , ; , Lightning-Thunder
    ~2: Cortex-A76 A12X, [12591] (45), 1114–4608, : [~6296], ; ~2304, ; , Vortex-Tempest
    ~1·7: Cortex-A76 A12, [8006] (27), 1111–2869, : [~4709], ; ~1688, ; , Vortex-Tempest
    ~1·5: Cortex-A73 A10X, [6475] (19·5), 832–2274, : [~4317], ; ~1516, ; , Hurricane-Zephyr
    ~2: Cortex-A75 A11, [7267] (26·3), 919–2372, : [~3634], ; ~1186, ; , Monsoon-Mistral
    ~2: Cortex-A76-A55 485, [4429], 767–2715, : [~2214], ; ~1358, ; , Kryo
    ·003: Cortex-M0+ 1.8V 64MHz, {9}, , : {2870}, ; , ; ,
    ~2: Cortex-A75-A55-M4 9820, [4298], 762–2148, : [~2149], ; ~1074, ; , Exynos
    ~2: Cortex-A76-A55 990, [4078], 761–2861, : [~2039], ; ~1431, ; , Kirin
    ~2·2: Cortex-A73 A10, [4748] (12·6), 744–1333, : [~1976], ; ~606, ; , Hurricane-Zephyr
    ·012: M14K, {29}, , : {2500}, ; , ; ,
    1·7: Cortex-A73-A53 280, [<3031] (), 387–1448, : [<1783], ; 852, ; , Kryo
    2? ~(40/53): Cortex-A53 625, [2604], 260–773, : [1725?], ; 387?, ; , Apollo
    2·5: Cortex-A75-A55 385, [3940] (), 514–2191, : [1576], ; 876, ; , Kryo
    ~2: Cortex-A53-M2 8895, [>3031] (), 373–1497, : [>1516], ; ~749, ; , Exynos
    ~2: Cortex-A57-A53 A9X, [3000] (10·5), 3097–5284, : [~1500], ; , ; , Twister
    ~1·5: Cortex-A53-A57 A9, [2200] (12·7), , : [~1467], ; , ; , Twister
    ·001: Cortex-M0+ SAM L21 12MHz, {2}, : {1929}, ; , ; ,
    ·0033: Cortex-M0 1.8V 50MHz, {6}, , : {1920}, ; , ; ,
    ·09: SH-X4, {172}, , : {1856}, ; , ; , SH-X4
    ·000825: Cortex-M4 STM32L412xx 8MHz, {15}, , : {1852}, ; , ; ,
    ·013: Cortex-M3, {24}, , : {1790}, ; , ; ,
    ·15: 24K, {235}, , : {1600}, ; , ; ,
    ~1·8: Cortex-A57-A53 A8X, [2032] (10·6), , : [~1129], ; , ; , Typhoon
    ·15: 34K, {232}, , : {1455}, ; , ; ,
  • alysdexia - Monday, December 30, 2019 - link

    dammit can't edit:

    15 ~3/5: i5-1035G7, : ~, Ice Lake
    15 ~11/14: i7-10710U, 13107: ~1112, 30 Comet Lake
    15 ~16/19: i5-10210U, : ~, Comet Lake
  • Findecanor - Monday, May 27, 2019 - link

    I wonder what the difference would be if ARM removed AArch32 support like Apple did.
  • RSAUser - Tuesday, May 28, 2019 - link

    We might see it with next gen as Google is dropping 32bit app support on the Play Store. If there is a performance advantage/cost or power saving, they'll probably implement it.
  • beginner99 - Tuesday, May 28, 2019 - link

    However with android you also get the performance at half the price. What these charts don't show is the actual size of the chip and apple SOCs are large and hence more expensive.
  • michael2k - Thursday, May 30, 2019 - link

    ARM isn't competing with Apple and doesn't need to compete with Apple.

    ARM licenses out it's implementation to those who also cannot compete with Apple.

    You get the benefit, as compensation for lower performance, of a cheaper phone.

    If you really want something that is as high performance you're going to have to buy from a company willing to invest the money into designing such a CPU, and that requires a different budget than licensing ARM's SoC/CPU designs.
  • Meteor2 - Monday, June 3, 2019 - link

    ARM's cores matched Apple for efficiency with A76 (a breakthrough design), alongside sufficient performance.

    With large increases with IPC per generation we're seeing from ARM, I don't think it will be long before the absolute performance gap is closed either -- if SoC manufacturers choose to equal Apple's peak power draw. They may well not, and nobody will mind.
  • markiz - Tuesday, June 4, 2019 - link

    Out of curiosity, what is it that you want to do on your phone that you feel is too slow on e.g. S855 phone?

Log in

Don't have an account? Sign up now