Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalghby Anand Lal Shimpi on December 17, 2013 11:56 PM EST
- Posted in
- Ask the Experts
Last week we held an awesome Ask the Experts Q&A with ARM's Peter Greenhalgh, lead architect for the Cortex A53. Peter did a great job answering questions in the comments, but for those of you who missed them we're compiling all of the Q&A here for you to go through.
On Friday, December 20th at 12:00PM ET, Peter will be joining me for a live Google Hangouts chat. We'll be posting more details on that later this week. For now, enjoy the responses!
Question from shodanshok
Hi, it would be interesting to know two thing:
- the cache memories (L1/L2) are write-back or write-through? Inclusive or exclusive?
- multiprocessor capabilities are limited to 4 cores or they can scale to 8+ cores without additional glue logic?
All cacheable attributes are supported, but Cortex-A53 is optimised around write-back, write-allocate. The L2 cache is inclusive on the instruction side and exclusive on the data side.
A Cortex-A53 cluster only supports up to 4-cores. If more than 4-cores are required in a platform then multiple clusters can be implemented and coherently connected using an interconnect such as CCI-400. The reason for not scaling to 8-cores per cluster is that the L2 micro-architecture would need to either compromise energy-efficiency in the 1-4 core range to achieve performance in the 4-8 core range, or compromise performance in the 4-8 core range to maximise energy-efficiency in the 1-4 core range. This isn’t a hard and fast rule for all clusters, but is the case for a cluster at the Cortex-A53 power/performance point. For the majority of mobile use cases it is best to focus on energy efficiency and enable more than 4-cores through multi-cluster solutions.
Question from lukarak
We have seen MediaTek introducing an 8xA7 SOC, instead of going to the big.LITTLE configuration of some sorts. Do you expect the same thing to happen with the A53 and A57 generation for low budget SOCs or will this generation's combo be a little easier and cheaper to implement?
We expect to see a range of platform configurations using Cortex-A53. A 4+4 Cortex-A53 platform configuration is fully supported and a logical progression from a 4+4 Cortex-A7 platform. A Cortex-A57 in the volume smartphone markets is less likely, but that’s a decision in the hands of the ARM partners. It will be interesting to see the range of Cortex-A53 platforms and configurations announced by partners over the coming months.
Question from ehsan.nitol
Hi there, I have some questions.
We have already seen how well Qualcomm's Cortex A7 can perform thanks to Moto G. How much will it improve with the new Cortex A53? What will be the core and performance wise difference? How will you compare it against Cortex A9, A12 and A15 in terms of performance, battery consumption and all.
With the Exynos Octa core processor Battery Test we haven't seen much battery improvements compared to Qualcomm's Snapdragon 600 and 800 Processor. How will it perform this time?
What is ARM planning do with its Mali GPU? What will be next after Cortex A53 and A57?
Cortex-A53 has the same pipeline length as Cortex-A7 so I would expect to see similar frequencies when implemented on the same process geometry. Within the same pipeline length the design team focussed on increasing dual-issue, in-order performance as far as we possibly could. This involved symmetric dual-issue of most of the instruction set, more forwarding paths in the datapaths, reduced issue latency, larger & more associative TLB, vastly increased conditional and indirect branch prediction resources and expanded instruction and data prefetching. The result of all these changes is an increase in SPECInt-2000 performance from 0.35-SPEC/Mhz on Cortex-A7 to 0.50-SPEC/Mhz on Cortex-A53. This should provide a noticeable performance uplift on the next generation of smartphones using Cortex-A53.
Question from Techguy99X
Why are the current A7 quad core phones performing similar to the A9 quad (exynos 4412 , tegra 3), although A9 is more advanced and OoO? What is the main difference between A5 and A7, becuase the A7 is just a bit faster than the A5?
Overall platform performance is dependent on many factors including processor, interconnect, memory controller, GPU, video and more. While the Cortex-A9 is a higher performance processor both in IPC and frequency, ARM partners are continuously improving their platforms and porting them to new process geometries. This allows a new generation Cortex-A7 based platform to improve on an older generation Cortex-A9 based platform.
Compared to Cortex-A5, Cortex-A7 increased load-store bandwidth, allowed more common data-processing operations to dual-issue and made some small improvements in the branch-predictors.
Question from barleyguy
When designing a processor and deciding on which performance attributes to emphasize, do you target current workloads for short term market concerns, or do you target possible future workloads for the market a year or two from now? Or is performance tuning more workload agnostic, and do you say "I want this to be fast for everything"?
For example, since ARM processors are very popular in the Android market, do you tune for content comsumption and gaming? Or since Android may be trending towards more of a primary computing device in the future, is it important to tune for desktop applications?
And finally, what are the considerations of performance tuning for thermally constrained devices?
Question from psychobriggsy
Do you have any plans to support various forms of turbo functionality within your next generation ARM cores? An a potential example, in a 28nm quad-core A53 setup at 1.2GHz, you could support dual-core at >1.4GHz and single core at >1.6GHz within the same power consumption (core design allowing, of course), yet single threaded performance could improve significantly.
ARM cores have been historically low power, however that doesn't mean there aren't more power savings to be made. Examples include deeper sleep states, power gated cores, and so on - features that Intel and AMD have had to include in order to reduce their TDPs whereas ARM cores haven't need them (yet). What are the future power saving methods that ARM is considering for its future cores (that you can give away)?
A Turbo mode is typically a form of voltage overdrive for brief periods of time to maximise performance, which ARM partners have been implementing on mobile platforms for many years. Whether this is applied to 1,2 or more cores is a decision of the Operating System and the platform power management software. If there is only one dominant thread you can bet that mobile platforms will be using Turbo mode. Due to the power-efficiency of Cortex-A53 on a 28nm platform, all 4 cores can comfortably be executing at 1.4GHz in less than 750mW which is easily sustainable in a current smartphone platform even while the GPU is in operation.
In terms of further power saving techniques, power gating unused cores is a technique that has been used since the first Cortex-A9 platforms appeared on the market several years ago. The technique is so fundamental that I think many in the mobile industry use it automatically and forget that it’s a highly beneficial power saving technique. But you are correct that there is more milage to come from deeper sleep states which is why both Cortex-A53 and Cortex-A57 support state retention techniques in caches and the NEON unit to further improve leakage power.
Question from Xebec
Peter, thanks for offering your time to Anandtech!
I was curious if you could talk a bit about how easy/difficult A53-derived SoCs might be to integrate into solutions that are already using A7/A9 type chips? i.e. Devices like Beagleboards, Raspberry Pis, ODROIDs, etc. Is there anything that makes the A53 particularly difficult or easy to suit to these types of devices?
Also, for Micro and "regular" servers, do you see A57/A53 big.LITTLE being the norm, or do you anticipate a variety of A53-only and A57-only designs? Any predictions on market split between the A5x series here?
Cortex-A53 has been designed to be able to easily replace Cortex-A7. For example, Cortex-A7 supports the same bus-interface standards (and widths) as Cortex-A7 which allows a partner who has already built a Cortex-A7 platform to rapidly convert to Cortex-A53.
With servers I think we will see a mix of solutions. The most popular approach will be to use Cortex-A57 due to the performance that micro-architecture is capable of providing, but I still expect some Cortex-A53 servers and big.LITTLE too!
Question from Doormat
ARM CPU vendors (Qualcomm, Nvidia, etc) seem to be choosing slower quad core over faster dual core, and I'm suspecting its all a marketing game (e.g. more cores is better, see Motorola's X8 announcement of an "8 core" phone). Do those non-technical decisions impact the decisions of the engineers in developing the ARM architecture?
You are quite correct that there are a variety of frequencies and core-counts being offered by ARM partners. However, for ARM design micro-architectures these do not have an effect on micro-architectures as we must be able to support a variety of target frequencies and core-counts across many different process geometries.
Question from Factory Factory
How does designing a CPU "by hand" differ from using an automated layout tool? What sort of trade-offs does/would using automated tools cause for ARM's cores?
Second question: With many chips from many manufacturers now implementing technologies like fine-grained power gating, extremely fine control of power and clock states, and efficient out-of-order execution pipelines, where does ARM go from here to keep its leadership in low-power compute IP?
Hand layout versus automated layout is an interesting trade-off. From one perspective, full hand-layout for all circuits in a processor is rarely used now. Aside from cache RAMs which are always custom, hand-layout is reserved for datapath and queues which are regular structures that allow a human to spot the regularity and ‘beat’ an automated approach. However, control logic is not amenable to hand-layout as it’s very difficult to beat automated tools which means that the control logic can end up setting the frequency of the processor without significant effort.
In general the benefit from hand-layout has been reducing in recent years. Partly this is due to the complexity of the design rules for advanced process generations reducing the scope for more specific circuit tuning techniques to be used. But another factor is the development of advanced standard cell libraries that have a large variety of cells and drive strengths which lessens the need for special circuit techniques. When we’re developing our processors we’re fortunate to have access to a large team in ARM designing standard cell libraries and RAMs who can advise us about upcoming nodes (for example 16nm and 10nm). In turn the processor teams can suggest & trial new advanced cells for the libraries which we call POPs (Processor Optimization Packages) that improve frequency, power and area.
A final trade-off to consider is process portability. After an ARM processor is licensed we see it on many different process geometries which is only possible because the designs are fully synthesizable. For example, there are Cortex-A7 implementations on all the major foundries from 65nm to 16nm. In combination with the advanced standard cell libraries for these processes there is little need to go to a hand-layout approach and we instead enable our partners to get to market more rapidly on the process geometry and foundry of their choosing.