Today, Apple has unveiled their brand-new MacBook line-up. This isn’t an ordinary release – if anything, the move that Apple is making today is something that hasn’t happened in 15 years: The start of a CPU architecture transition across their whole consumer Mac line-up.

Thanks to the company’s vertical integration across hardware and software, this is a monumental change that nobody but Apple can so swiftly usher in. The last time Apple ventured into such an undertaking in 2006, the company had ditched IBM’s PowerPC ISA and processors in favor of Intel x86 designs. Today, Intel is being ditched in favor of the company’s own in-house processors and CPU microarchitectures, built upon the Arm ISA.

The new processor is called the Apple M1, the company’s first SoC designed with Macs in mind. With four large performance cores, four efficiency cores, and an 8-GPU core GPU, it features 16 billion transistors on a 5nm process node. Apple’s is starting a new SoC naming scheme for this new family of processors, but at least on paper it looks a lot like an A14X.

Today’s event contained a ton of new official announcements, but also was lacking (in typical Apple fashion) in detail. Today, we’re going to be dissecting the new Apple M1 news, as well as doing a microarchitectural deep dive based on the already-released Apple A14 SoC.

The Apple M1 SoC: An A14X for Macs

The new Apple M1 is really the start of a new major journey for Apple. During Apple’s presentation the company didn’t really divulge much in the way of details for the design, however there was one slide that told us a lot about the chip’s packaging and architecture:

This packaging style with DRAM embedded within the organic packaging isn't new for Apple; they've been using it since the A12. However it's something that's only sparingly used. When it comes to higher-end chips, Apple likes to use this kind of packaging instead of your usual smartphone POP (package on package) because these chips are designed with higher TDPs in mind. So keeping the DRAM off to the side of the compute die rather than on top of it helps to ensure that these chips can still be efficiently cooled.

What this also means is that we’re almost certainly looking at a 128-bit DRAM bus on the new chip, much like that of previous generation A-X chips.

On the very same slide, Apple also seems to have used an actual die shot of the new M1 chip. It perfectly matches Apple’s described characteristics of the chip, and it looks looks like a real photograph of the die. Cue what's probably the quickest die annotation I’ve ever made:

We can see the M1’s four Firestorm high-performance CPU cores on the left side. Notice the large amount of cache – the 12MB cache was one of the surprise reveals of the event, as the A14 still only featured 8MB of L2 cache. The new cache here looks to be portioned into 3 larger blocks, which makes sense given Apple’s transition from 8MB to 12MB for this new configuration, it is after all now being used by 4 cores instead of 2.

Meanwhile the 4 Icestorm efficiency cores are found near the center of the SoC, above which we find the SoC’s system level cache, which is shared across all IP blocks.

Finally, the 8-core GPU takes up a significant amount of die space and is found in the upper part of this die shot.

What’s most interesting about the M1 here is how it compares to other CPU designs by Intel and AMD. All the aforementioned blocks still only cover up part of the whole die, with a significant amount of auxiliary IP. Apple made mention that the M1 is a true SoC, including the functionality of what previously was several discrete chips inside of Mac laptops, such as I/O controllers and Apple's SSD and security controllers.

The new CPU core is what Apple claims to be the world’s fastest. This is going to be a centre-point of today’s article as we dive deeper into the microarchitecture of the Firestorm cores, as well look at the performance figures of the very similar Apple A14 SoC.

With its additional cache, we expect the Firestorm cores used in the M1 to be even faster than what we’re going to be dissecting today with the A14, so Apple’s claim of having the fastest CPU core in the world seems extremely plausible.

The whole SoC features a massive 16 billion transistors, which is 35% more than the A14 inside of the newest iPhones. If Apple was able to keep the transistor density between the two chips similar, we should expect a die size of around 120mm². This would be considerably smaller than past generation of Intel chips inside of Apple's MacBooks.

Road To Arm: Second Verse, Same As The First

Section by Ryan Smith

The fact that Apple can even pull off a major architectural transition so seamlessly is a small miracle, and one that Apple has quite a bit of experience in accomplishing. After all, this is not Apple’s first-time switching CPU architectures for their Mac computers.

The long-time PowerPC company came to a crossroads around the middle of the 2000s when the Apple-IBM-Motorola (AIM) alliance, responsible for PowerPC development, increasingly struggled with further chip development. IBM’s PowerPC 970 (G5) chip put up respectable performance numbers in desktops, but its power consumption was significant. This left the chip non-viable for use in the growing laptop segment, where Apple was still using Motorola’s PowerPC 7400 series (G4) chips, which did have better power consumption, but not the performance needed to rival what Intel would eventually achieve with its Core series of processors.

And thus, Apple played a card that they held in reserve: Project Marklar. Leveraging the flexibility of the Mac OS X and its underlying Darwin kernel, which like other Unixes is designed to be portable, Apple had been maintaining an x86 version of Mac OS X. Though largely considered to initially have been an exercise in good coding practices – making sure Apple was writing OS code that wasn’t unnecessarily bound to PowerPC and its big-endian memory model – Marklar became Apple’s exit strategy from a stagnating PowerPC ecosystem. The company would switch to x86 processors – specifically, Intel’s x86 processors – upending its software ecosystem, but also opening the door to much better performance and new customer opportunities.

The switch to x86 was by all metrics a big win for Apple. Intel’s processors delivered better performance-per-watt than the PowerPC processors that Apple left behind, and especially once Intel launched the Core 2 (Conroe) series of processors in late 2006, Intel firmly established itself as the dominant force for PC processors. This ultimately setup Apple’s trajectory over the coming years, allowing them to become a laptop-focused company with proto-ultrabooks (MacBook Air) and their incredibly popular MacBook Pros. Similarly, x86 brought with it Windows compatibility, introducing the ability to directly boot Windows, or alternatively run it in a very low overhead virtual machine.

The cost of this transition, however, came on the software side of matters. Developers would need to start using Apple’s newest toolchains to produce universal binaries that could work on PPC and x86 Macs – and not all of Apple’s previous APIs would make the jump to x86. Developers of course made the jump, but it was a transition without a true precedent.

Bridging the gap, at least for a bit, was Rosetta, Apple’s PowerPC translation layer for x86. Rosetta would allow most PPC Mac OS X applications to run on the x86 Macs, and though performance was a bit hit-and-miss (PPC on x86 isn’t the easiest thing), the higher performance of the Intel CPUs helped to carry things for most non-intensive applications. Ultimately Rosetta was a band-aid for Apple, and one Apple ripped off relatively quickly; Apple already dropped Rosetta by the time of Mac OS X 10.7 (Lion) in 2011. So even with Rosetta, Apple made it clear to developers that they expected them to update their applications for x86 if they wanted to keeping selling them and to keep users happy.

Ultimately, the PowerPC to x86 transitions set the tone for the modern, agile Apple. Since then, Apple has created a whole development philosophy around going fast and changing things as they see fit, with only limited regard to backwards compatibility. This has given users and developers few options but to enjoy the ride and keep up with Apple’s development trends. But it has also given Apple the ability to introduce new technologies early, and if necessary, break old applications so that new features aren’t held back by backwards compatibility woes.

All of this has happened before, and it will all happen again starting next week, when Apple launches their first Apple M1-based Macs. Universal binaries are back, Rosetta is back, and Apple’s push to developers to get their applications up and running on Arm is in full force. The PPC to x86 transition created the template for Apple for an ISA change, and following that successful transition, they are going to do it all over again over the next few years as Apple becomes their own chip supplier.

A Microarchitectural Deep Dive & Benchmarks

In the following page we’ll be investigating the A14’s Firestorm cores which will be used in the M1 as well, and also do some extensive benchmarking on the iPhone chip, setting the stage as the minimum of what to expect from the M1:

Apple's Humongous CPU Microarchitecture
Comments Locked

644 Comments

View All Comments

  • mdriftmeyer - Thursday, November 12, 2020 - link

    Logic Pro X
  • vais - Thursday, November 12, 2020 - link

    Great article until it reached the benchmark against x86 part.
    I am amazed how something can claim to be a benchmark and yet leave out what is being measured, what are the criteria, are the results adjusted for power, etc.

    Here are some quotes from the article and why they seem to be a biased towards Apple, bordering on fanboyism:

    "x86 CPUs today still only feature a 4-wide decoder designs (Intel is 1+4) that is seemingly limited from going wider at this point in time due to the ISA’s inherent variable instruction length nature, making designing decoders that are able to deal with aspect of the architecture more difficult compared to the ARM ISA’s fixed-length instructions"

    And who ever said wider is always better, especially in two different instruction sets? Comparing apples to melons here...

    "On the ARM side of things, Samsung’s designs had been 6-wide from the M3 onwards, whilst Arm’s own Cortex cores had been steadily going wider with each generation, currently 4-wide in currently available silicon"

    Based on that alone would you conclude Exynos is some miracle of CPU design and it somehow comes anywhere close to the performance of a full blown desktop enthusiast grade CPU? Sure hope not.

    "outstanding lode/store:
    To not surprise, this is also again deeper than any other microarchitecture on the market. Interesting comparisons are AMD’s Zen3 at 44/64 loads & stores, and Intel’s Sunny Cove at 128/72. "

    Again comparing different things and drawing conclusions like it's a linear scale. AMD's load/stores are significantly less than Intel's and yes AMD Zen3 CPUs outperform Intel counterparts across the board. I'd say biased as hell...

    "AMD also wouldn’t be looking good if not for the recently released Zen3 design."
    So comparing yet unreleased core to the latest already available from the competition and somehow the competition is in a bad place as "only" it's latest product is better? Come on...

    "The fact that Apple is able to achieve this in a total device power consumption of 5W including the SoC, DRAM, and regulators, versus +21W (1185G7) and 49W (5950X) package power figures, without DRAM or regulation, is absolutely mind-blowing."

    I am really interested where those power package figures come from, specifically for the 5950X. AMD's site lists it as 105W TDP. How were the 49W measured?

    I've read other articles from Andrei which have been technical, detailed and specific marvels, but this one misses the mark by a long shot in the benchmarks and conclusion parts.
  • Bluetooth - Thursday, November 12, 2020 - link

    They don’t have an actual M1 to test as they say in the artcle. The M1 will be available on the 24th.
  • GeoffreyA - Thursday, November 12, 2020 - link

    I think it would be instructive to remember the Pentium 4, which had a lot of "fast" terms for its time: hyper-pipelined this, double pumped ALUs, quad pumped that; but we all know the result. The proof of the pudding is in the eating, or in the field of CPUs, performance, power, and silicon area.

    AMD and Intel have settled down to 4- and 5-wide decode as the best trade-offs for their designs. They could make it 8-wide tomorrow, but it's likely no use, and would cause disaster from a power point of view.* If Apple wishes to go for wide, good for them, but the CPU will be judged not on "I've got this and that," but on what its final merits.

    Personally, I think it's better engineering to produce a good result with fewer ingredients. Compare Z3's somewhat conservative out-of-order structures to Sunny Cove's, but beating it.

    When the M1 is on an equal benchmark field with 5 nm x86, then we'll see whether it's got the goods or not.

    * Decoding takes up a lot of power in x86, that's why the micro-op cache is so effective (removing fetch and pre/decode). In x86, decoding can't be done in parallel, owing to the varying instruction lengths: one has to determine first how long one instruction is before knowing where the next one starts, whereas in fixed-length ISAs, like ARM, it can be done in parallel: length being fixed, we know where each instruction starts.
  • Joe Guide - Thursday, November 12, 2020 - link

    The benchmarks are coming out, and it looks like the pudding is quite tasty. But you have a good point. When in 2025 or 2026 Intel or AMD releases their newest 5 nm x86, you will be proven to be prophetic that the new Intel chip resoundingly beats the base M1 chip from 5 years ago.
  • GeoffreyA - Thursday, November 12, 2020 - link

    That line about the M1 and 5 nm is silly on my part, I'll admit. Sometimes we write things and regret it later. Also, if you look at my comment from the other day, you'll see the first thing I did was acknowledge Apple's impressive work on this CPU. The part about the Pentium 4 and the pudding wasn't in response to the A14's performance, but this whole debate running through the comments about wide vs. narrow, and so I meant, "Wide, narrow, doesn't mean anything. What matters is the final performance."

    I think what I've been trying to say, quite feebly, through the comments is: "Yes, the A14 has excellent performance/watt, and am shocked how 5W can go up against 105W Ryzen. But, fanboy comment it may be, I'm confident AMD and Intel (or AMD at any rate) can improve their stuff and beat Apple."
  • Joe Guide - Thursday, November 12, 2020 - link

    I see this as glass half full. There was been far too much complacency in the CPU development over the last decade. If it take Apple to kick the industry in the butt, well then, how is that bad.

    Moore's Law has awoke after a deep slumber and it is hungry and angry. Run Intel. Run for your life.
  • GeoffreyA - Friday, November 13, 2020 - link

    Agreed, when AMD was struggling, Intel's improvements were quite meagre (Sandy Bridge excepted). Much credit must be given to AMD though. Their execution of the past few years has been brilliant.
  • chlamchowder - Friday, November 13, 2020 - link

    In x86, decoding is very much done in parallel. That's how you get 3/4/5-wide decoders. The brute force method is to tentatively start decoding at every byte. Alternatively, you mark instruction boundaries in the instruction cache (Goldmont/Tremont do this, as well as older AMD CPUs like Phenom).
  • GeoffreyA - Saturday, November 14, 2020 - link

    Thanks for that. I'm only a layman in all this, so I don't know the exact details. I did suspect there was some sort of trick going on to decode more than one at a time. Marking instructions boundaries in the cache is quite interesting because it ought to tone down, or even eliminate, x86's variable length troubles. Didn't know about Tremont and Goldmont, but I was reading that the Pentium MMX, as well as K8 to Bulldozer, perhaps K7 too, used this trick.

    My question is, do you think AMD and Intel could re-introduce it (while keeping the micro-op cache as well)? Is it effective or does it take too much effort itself? I ask because if it's worth it, it could help x86's length problem quite a bit, and that's something which excites me, under this current climate of ARM. However, judging from the results, it didn't aid the Athlon, Phenom, and Bulldozer that drastically, and AMD abandoned it in Zen, going for a micro-op cache instead, so that knocks down my hopes a bit.

Log in

Don't have an account? Sign up now