Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm

Name: Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm
Item: Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm
Author: Andrei Frumusanu

by Andrei Frumusanu on May 31, 2018 3:01 PM EST

123 Comments | Add A Comment

123 Comments

Cortex A76 µarch - Frontend

Starting off with a rough overview of the Cortex A76 microarchitectural diagram we see the larger functional blocks. The A76 doesn’t look too different than other Arm processors in this regard and the differences come only with details that Arm is willing to divulge. To overly simplify it, this is a superscalar out-of-order core with a 4-wide decode front-end with 8 execution ports in the backend with a total implementation pipeline depth of 13 stages with the execution latencies of a 11 stage core.

In the front-end, Arm has created a new predict/fetch unit that it calls a “predict-directed fetch”, meaning the branch prediction unit feeds into the instruction fetch unit. This is a divergence from past Arm µarches and it allows for both higher performance and lower power consumption.

The branch prediction unit is what Arm calls a first in the industry in adopting a hybrid indirect predictor. The predictor is decoupled from the fetch unit and its supporting large structures operate separate from the rest of the machine – likely what this means is that it will be easier to clock-gate during operation to save on power. The branch predictor is supported by 3-level branch target caches; a 16-entry nanoBTB, a 64-entry microBTB and a 6000 entry main BTB. Arm claimed back in the A73 and A75 generations of branch predictors were able to nearly predict all taken branches so this new unit in the A76 seems to be one level above that in capability.

The branch unit operates at double the bandwidth of the fetch unit – it operates on 32B/cycle meaning up to 8 32b instructions per cycle. This feeds a fetch queue in front of the instruction fetch consisting of 12 “blocks”. The fetch unit operates at 16B/cycle meaning 4 32b instructions. The branch unit operating at double the throughput makes it possible to get ahead of the fetch unit. What this serves is that in the case of a mispredict it can hide branch bubbles in the pipeline and avoid stalling the fetch unit and the rest of the core. The core is said to able to cope with up to 8 misses on the I-side.

I mentioned at the beginning that the A76 is a 13-stage implementation with the latency of an 11-stage core. What happens is that in latency-critical paths the stages can be overlapped. One such cycle happens between the second cycle of the branch predict path and the first cycle of the fetch path. So effectively while there’s 4 (2+2) pipeline stages on the branch and fetch, the core has latencies of down to 3 cycles.

On the decode and rename stages we see a throughput of 4 instructions per cycle. The A73 and A75 were respectively 2 and 3-wide in their decode stages so the A76 is 33% wider than the last generation in this aspect. It was curious to see the A73 go down in decode width from the 3-wide A72, but this was done to optimise for power efficiency and “leanness” of the pipeline with goals of improving the utilisation of the front-end units. With the A76 going 4-wide, this is also Arms to date widest microarchitecture – although it’s still extremely lean when putting it into juxtaposition with competing µarches from Samsung or Apple.

The fetch unit feeds a decode queue of up to 16 32b instructions. The pipeline stages here consist of 2 cycles of instruction align and decode. It looks here Arm decided to go back to a 2-cycle decode as opposed to the 1-cycle unit found on the A73 and A75. As a reminder the Sophia cores still required a secondary cycle on the decode stage when handling instructions utilising the ASIMD/FP pipelines so Arm may have found other optimisation methods with the A76 µarch that warranted this design decision.

The decode stage takes in 4 instructions per cycle and outputs macro-ops at an average ratio of 1.06Mops per instruction. Entering the register rename stage we see heavy power optimisation as the rename units are separated and clock gated for integer/ASIMD/flag operations. The rename and dispatch are a 1 cycle stage which is a reduction from the 2-cycle rename/dispatch from the A73 and A75. Macro-ops are expanded into micro-ops at a ratio of 1.2µop per instruction and we see up to 8µops dispatched per cycle, which is an increase from the advertised 6µops/cycle on the A75 and 4µops/cycle on the A73.

The out-of-order window commit size of the A76 is 128 and the buffer is separated into two structures responsible for instruction management and register reclaim, called a hybrid commit system. Arm here made it clear that it wasn’t focusing on increasing this aspect of the design as it found it as a terrible return on investment when it comes to performance. It is said that the performance scaling is 1/7^th – meaning a 7% increase of the reorder buffer only results in a 1% increase in performance. This comes at great juxtaposition compared to for example Samsung's M3 cores with a very large 224 ROB.

As a last note on the front-end, Arm said to have tried to optimised the front-end for lowest possible latency for hypervisor activity and system calls, but didn’t go into more details.

The Arm Cortex A76 - Introduction Cortex A76 µarch - Backend

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

123 Comments

View All Comments

tipoo - Thursday, May 31, 2018 - link
Still a 4-wide front end, I don't imagine it'll catch A10, maybe A9 per core then eh.
wicketr - Thursday, May 31, 2018 - link
I just don't understand why ARM doesn't at least come out with a design that can match the Monsoon cores of an A11, or even the power of what will likely be the next A12 cores. It seems like ARM is eternally 2-3 steps behind Apple on this and they need to catch up.
shadowx360 - Thursday, May 31, 2018 - link
Probably their power/efficiency constraints. They manage to get the same performance as a M3 core with a 4 wide instead of 6 wide decoder and half the power usage. The A11 cores are absolute monsters at power draw at max performance but Apple is able to tweak the hell out of the rest of the device and OS to get the battery life in check. Android OEMs don't have that much control.
wicketr - Thursday, May 31, 2018 - link
And I could understand the power issues for phones, but not all ARM chips are destined for phones. Some can go into cars or gaming consoles that are always plugged in and well ventilated.

I just think they should come out with another tier ( Cortex A9X series) that can go toe-to-toe with Apple's best even if it is too power hungry for phones. Just come up with a design and see where we're at.
Wilco1 - Thursday, May 31, 2018 - link
Using a much larger core to get modest extra performance wouldn't make sense even in less power constrained cases. Not every market is happy with just 2 huge cores, so power and area efficiency remain important. For laptops binning for frequency and adding turbo modes would make far more sense.
BillBear - Friday, June 1, 2018 - link
>Using a much larger core to get modest extra performance wouldn't make sense even in less power constrained cases.

It makes perfect sense if you don't care that your core is large, because you aren't just selling a SOC. For Qualcomm, increased die size means reduced profit. For Apple, it does not.

For instance, Apple's Cyclone core from 2013:

>With six decoders and nine ports to execution units, Cyclone is big. As I mentioned before, it's bigger than anything else that goes in a phone. Apple didn't build a Krait/Silvermont competitor, it built something much closer to Intel's big cores. At the launch of the iPhone 5s, Apple referred to the A7 as being "desktop class" - it turns out that wasn't an exaggeration.

https://www.anandtech.com/show/7910/apples-cyclone...
Matthmaroo - Monday, June 4, 2018 - link
Apple has so many built in advantages - huge RD , excellent engineering, closed system ... android manufacturers are disadvantaged to Apple inso manu ways
close - Tuesday, June 5, 2018 - link
ARM has to build a "one size fits all" kind of solution. Unlike Apple they are not catering for a single customer with full control over every aspect of HW and SW development, and the profits associated with that.

Plus, achieving the power that the Apple cores bring doesn't come cheap. Samsung's Exynos is still lagging behind and it's not like Samsung doesn't have expertise or deep pockets.
techconc - Tuesday, June 5, 2018 - link
Yeah, but when you have a big little architecture, OEMs could choose the most efficient combination to meet their needs. There needs to be a powerful single core option that's available for the ARM platform. Until ARM goes there, the rest of the ARM community will be behind Apple. Remember, not all workloads can take advantage of multiple cores. At best ARM will be approaching 2016 level Apple A series core performance.
bananaforscale - Saturday, June 9, 2018 - link
Excellent engineering? Like the bendgate, touch screen problems etc. that were *engineering screwups*?

Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nm

Cortex A76 µarch - Frontend

Post Your Comment

123 Comments

View All Comments

tipoo - Thursday, May 31, 2018 - link

wicketr - Thursday, May 31, 2018 - link

shadowx360 - Thursday, May 31, 2018 - link

wicketr - Thursday, May 31, 2018 - link

Wilco1 - Thursday, May 31, 2018 - link

BillBear - Friday, June 1, 2018 - link

Matthmaroo - Monday, June 4, 2018 - link

close - Tuesday, June 5, 2018 - link

techconc - Tuesday, June 5, 2018 - link

bananaforscale - Saturday, June 9, 2018 - link

Log in

Don't have an account? Sign up now