Arm Announces New Mali-G710, G610, G510 & G310 Mobile GPU Familiesby Andrei Frumusanu on May 25, 2021 10:00 AM EST
The Mali-G710: Doubling up per-core performance
As a continuation of the Valhall GPU architecture, the cornerstone characteristics of the new G710’s execution engines are similar and roughly the same as what we’ve covered in the past generation Mali-G77 and Mali-G78.
Amongst the larger changes we saw with Valhall was the shift from a wavefront/warp size of 8 towards 16, with dual datapaths (clusters) per execution engine, resulting in a 32 FMA/core design that we saw in the G77 and G78.
The ISA is said to have seen larger improvements that was designed with new modern APIs such as Vulkan – it’s always quite hard to quantify the impact such changes have on the overall performance and efficiency of a GPU.
What’s new in the Mali-G710 is the addition of a second execution engine, effectively doubling up on the compute performance per shader core of the Valhall architecture. In a sense, Arm here is re-adopting some of its scaling means that we had seen in past generation Mali architectures, such as compared to when the Mali-G76 had for example three execution engines per shader core.
In the above slide, the “8x” and “4x” metrics are in regards to the throughput per cycle per core, and we can see by the metrics that other functional blocks of the GPU have also doubled up in terms of throughput to keep up with the doubled up compute execution throughput of the execution engines.
The new G710 includes a brand-new texture unit that is now able to handle up to 8 bilinear texels per clock, and Arm has generally optimised the new design to be significantly more area efficient, giving the new TMU a +50% performance density advantage.
Within the execution engine Arm continues to employ two processing units or clusters of processing elements, and in that regard, we don’t see that much difference between the generations, however if we look deeper into the actual processing unit there are changes to the blocks:
In the simplest and fundamental explanation, what we’re seeing is a shift from a single instance of 16-wide (warp wide) processing elements and execution units, to four instances of 4-wide execution units. The throughput between the designs doesn’t change, but the new microarchitecture gives more dedicated resources to the processing elements and allows for better structing for better efficiency.
Overall, the new execution engine design doubles up the FMA’s per clock per core, which is somewhat obvious, but also has the benefit of lowering the energy distribution within the shader core from the execution engine by 20%.
A further very large highlight of the G710 is the replacement of the traditional “Job Manager” with the new “Command Stream Frontend”, which handles scheduling and handling of draw-calls. The CSF introduces a new CPU of undisclosed nature, and for the first time will now also introduce a firmware layer to Mali GPUs.
The goals of the design is achieving more flexible and scalable performance for more complex graphical workloads while at the same time improving on system CPU power efficiency by reducing driver overhead by providing it with a very light weight submission path. It helps for simplified support of API features such as state inheritance and secondary buffers, and handling timing sensitive applications such as VR or time-warp applications. Synchronisation events also greatly benefit from the move closer to the hardware and the reduction of latency that this enables.
The firmware is closely couples to the hardware and handles requests from the host, or command buffer completion notifications, reduces overhead of things such as protected entry exit, or even allows for emulation of API features that don’t yet exist in the hardware through additional instructions.
The new hardware has been redesigned from the ground-up to be able to keep up with modern content and allow for the throughput of job submission into other GPU units. Arm here claims that the new CSF allows for up to 5 million drawcalls per second.
Overall, the new G710 microarchitecture seems very interesting and in particular seems to want to address some API overhead related weaknesses of Arm’s Mali GPUs. How this plays out remains to be seen, but from the advertised performance and power efficiency gains of 20% this generation, it seems like a solid improvement, although in these figures wouldn’t be quite sufficient to alter the competitive landscape in the mobile market.
The Mali-G610 is the same microarchitecture as the G710, only with a different name with core configurations lower than 7 cores.