Arm's New Cortex-A77 CPU Micro-architecture: Evolving Performance

Name: Arm's New Cortex-A77 CPU Micro-architecture: Evolving Performance
Item: Arm's New Cortex-A77 CPU Micro-architecture: Evolving Performance
Author: Andrei Frumusanu

by Andrei Frumusanu on May 27, 2019 12:01 AM EST

Posted in
Mobile
CPUs
Arm
SoCs
IPC
Cortex A77

108 Comments | Add A Comment

108 Comments

The Cortex-A77 µarch: Added ALUs & Better Load/Stores

Having covered the front-end and middle-core, we move onto the back-end of the Cortex-A77 and investigate what kind of changes Arm has made to the execution units and data pipelines.

On the integer execution side of the core we’ve seen the addition of a second branch port, which goes along with the doubling of the branch-predictor bandwidth of the front-end.

We also see the addition on an additional integer ALU. This new unit goes half-way between a simple single-cycle ALU and the existing complex ALU pipeline: It naturally still has the ability of single-cycle ALU operations but also is able to support the more complex 2-cycle operations (Some shift combination instructions, logical instructions, move instructions, test/compare instructions). Arm says that the addition of this new pipeline saw a surprising amount of performance uplift: As the core gets wider, the back-end can become a bottleneck and this was a case of the execution units needing to grow along with the rest of the core.

A larger change in the execution core was the unification of the issue queues. Arm explains that this was done in order to maintain efficiency of the core with the added execution ports.

Finally, existing execution pipelines haven’t seen much changes. One latency improvement was the pipelining of the integer multiply unit on the complex ALU which allows it to achieve 2-3 cycle multiplications as opposed to 4.

Oddly enough, Arm didn’t make much mention of the floating-point / ASIMD pipelines for the Cortex-A77. Here it seems the A76’s “state-of-the-art” design was good enough for them to focus the efforts elsewhere on the core for this generation.

On the part of the load/store units, we still find two units, however Arm has added two additional dedicated store ports to the units, which in effect doubles the issue bandwidth. In effect this means the L/S units are 4-wide with 2 address generation µOps and 2 store data µOps.

The issue queues themselves again have been unified and Arm has increased the capacity by 25% in order to expose more memory-level parallelism.

Data prefetching is incredibly important in order to hide memory latency of a system: Shaving off cycles by avoiding to having to wait for data can be a big performance boost. I tried to cover the Cortex-A76’s new prefetchers and contrast it against other CPUs in the industry in our review of the Galaxy S10. What stood out for Arm is that the A76’s new prefetchers were outstandingly performant and were able to deal with some very complex patterns. In fact the A76 did far better than any other tested microarchitecture, which is quite a feat.

For the A77, Arm improved the prefetchers and added in even new additional prefetching engines to improve this even further. Arm is quite tight-lipped about the details here, but we’re promised increased pattern coverages and better prefetching accuracy. One such change is claimed to be “increased maximum distance”, which means the prefetchers will recognize repeated access patterns over larger virtual memory distances.

One new functional addition in the A77 is so called “system-aware prefetching”. Here Arm is trying to solve the issue of having to use a single IP in loads of different systems; some systems might have better or worse memory characteristics such as latency than others. In order to deal with this variance between memory subsystems, the new prefetchers will change the behaviour and aggressiveness based on how the current system is behaving.

A thought of mine would be that this could signify some interesting performance improvements under some DVFS conditions – where the prefetchers will alter their behaviour based on the current memory frequency.

Another aspect of this new system-awareness is more knowledge of the cache pressure of the DSU’s L3 cache. In case that other CPU cores would be highly active, the core’s prefetchers would see this and scale down its aggressiveness in order to possibly avoid thrashing the shared cache needlessly, increasing overall system performance.

The Cortex-A77 µarch: Going For A 6-Wide* Front-End Performance: 20-35% Better IPC, End Remarks

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

108 Comments

View All Comments

saylick - Monday, May 27, 2019 - link
All this hype about the large cores... Where's the love for an improved A55 with better perf/W without considering process benefits?
Wilco1 - Monday, May 27, 2019 - link
Since the performance gap with the big cores widens so quickly, increasing performance of the little cores seems more important than further increasing perf/W. Note this years's 7+nm and next year's 5nm will help perf/W already.
Meteor2 - Monday, June 3, 2019 - link
The smaller cores sit in the background doing background stuff. Work/energy is the key metric for them, and A55 is the best. It's tough to improve; A76 and A77 only match A55 on that score, at best
eastcoast_pete - Monday, May 27, 2019 - link
While I get the focus of the article and the comments here on what this means for smartphones, I think this is even bigger for efforts by Qualcomm and Huawei to break into the ultraportable market. I think that's what the 3 GHz target frequency ARM mentioned is for. The A76-based large Snapdragon chip is already a promising alternative to Intel's low-power lineup, so the evolutionary step up of the A77 likely makes it even more attractive. As for Huawei, it'll depend on how much of the tech has already been transferred from ARM, and how badly China will want a "home grown" (of sorts) alternative to Intel.
pugster - Tuesday, May 28, 2019 - link
Unless there is some settlement in the trade talks where Huawei can work with ARM again, I don't think Huawei will release an SOC with an A77 in it. Since Huawei has an architecture license from ARM already they could release optimized ARM soc that could rival the A77.
GTan - Monday, May 27, 2019 - link
"Qualcomm has proclaimed a 45% leap in CPU performance compared to the previous generation Snapdragon 855 with Cortex-A75 cores, the biggest generational leap ever."

There is a typo. It's the Snapdragon 845 with the Cortex A-75 cores, not the Snapdragon 855.
NetMage - Monday, May 27, 2019 - link
Perhaps the typo is A-75 should be A-76?
Andrei Frumusanu - Tuesday, May 28, 2019 - link
Yes, corrected.
ksec - Monday, May 27, 2019 - link
So basically A77 7nm SoC will be about as fast as an 10nm A10 from Apple.

Or in other words, if Apple discontinued iPhone 7 this year and lower the iPhone 8 price as their entry model, the iPhone 8 will as fast or faster than 99% of the Android Phones on the market.
Wilco1 - Monday, May 27, 2019 - link
Cortex-A77 will match or beat A11. Cortex-A76 already scores around 3600 on GB4, so an extra 18% gives around 4300, right at the top end of A11.

Arm's New Cortex-A77 CPU Micro-architecture: Evolving Performance

The Cortex-A77 µarch: Added ALUs & Better Load/Stores

Post Your Comment

108 Comments

View All Comments

saylick - Monday, May 27, 2019 - link

Wilco1 - Monday, May 27, 2019 - link

Meteor2 - Monday, June 3, 2019 - link

eastcoast_pete - Monday, May 27, 2019 - link

pugster - Tuesday, May 28, 2019 - link

GTan - Monday, May 27, 2019 - link

NetMage - Monday, May 27, 2019 - link

Andrei Frumusanu - Tuesday, May 28, 2019 - link

ksec - Monday, May 27, 2019 - link

Wilco1 - Monday, May 27, 2019 - link

Log in

Don't have an account? Sign up now