The Mali-G710: Doubling up per-core performance

As a continuation of the Valhall GPU architecture, the cornerstone characteristics of the new G710’s execution engines are similar and roughly the same as what we’ve covered in the past generation Mali-G77 and Mali-G78.

Amongst the larger changes we saw with Valhall was the shift from a wavefront/warp size of 8 towards 16, with dual datapaths (clusters) per execution engine, resulting in a 32 FMA/core design that we saw in the G77 and G78.

The ISA is said to have seen larger improvements that was designed with new modern APIs such as Vulkan – it’s always quite hard to quantify the impact such changes have on the overall performance and efficiency of a GPU.

What’s new in the Mali-G710 is the addition of a second execution engine, effectively doubling up on the compute performance per shader core of the Valhall architecture. In a sense, Arm here is re-adopting some of its scaling means that we had seen in past generation Mali architectures, such as compared to when the Mali-G76 had for example three execution engines per shader core.

In the above slide, the “8x” and “4x” metrics are in regards to the throughput per cycle per core, and we can see by the metrics that other functional blocks of the GPU have also doubled up in terms of throughput to keep up with the doubled up compute execution throughput of the execution engines.

The new G710 includes a brand-new texture unit that is now able to handle up to 8 bilinear texels per clock, and Arm has generally optimised the new design to be significantly more area efficient, giving the new TMU a +50% performance density advantage.

Within the execution engine Arm continues to employ two processing units or clusters of processing elements, and in that regard, we don’t see that much difference between the generations, however if we look deeper into the actual processing unit there are changes to the blocks:

In the simplest and fundamental explanation, what we’re seeing is a shift from a single instance of 16-wide (warp wide) processing elements and execution units, to four instances of 4-wide execution units. The throughput between the designs doesn’t change, but the new microarchitecture gives more dedicated resources to the processing elements and allows for better structing for better efficiency.

Overall, the new execution engine design doubles up the FMA’s per clock per core, which is somewhat obvious, but also has the benefit of lowering the energy distribution within the shader core from the execution engine by 20%.

A further very large highlight of the G710 is the replacement of the traditional “Job Manager” with the new “Command Stream Frontend”, which handles scheduling and handling of draw-calls. The CSF introduces a new CPU of undisclosed nature, and for the first time will now also introduce a firmware layer to Mali GPUs.

The goals of the design is achieving more flexible and scalable performance for more complex graphical workloads while at the same time improving on system CPU power efficiency by reducing driver overhead by providing it with a very light weight submission path. It helps for simplified support of API features such as state inheritance and secondary buffers, and handling timing sensitive applications such as VR or time-warp applications. Synchronisation events also greatly benefit from the move closer to the hardware and the reduction of latency that this enables.

The firmware is closely couples to the hardware and handles requests from the host, or command buffer completion notifications, reduces overhead of things such as protected entry exit, or even allows for emulation of API features that don’t yet exist in the hardware through additional instructions.

The new hardware has been redesigned from the ground-up to be able to keep up with modern content and allow for the throughput of job submission into other GPU units. Arm here claims that the new CSF allows for up to 5 million drawcalls per second.

Overall, the new G710 microarchitecture seems very interesting and in particular seems to want to address some API overhead related weaknesses of Arm’s Mali GPUs. How this plays out remains to be seen, but from the advertised performance and power efficiency gains of 20% this generation, it seems like a solid improvement, although in these figures wouldn’t be quite sufficient to alter the competitive landscape in the mobile market.

The Mali-G610 is the same microarchitecture as the G710, only with a different name with core configurations lower than 7 cores.

Third Generation of Valhall Mali GPUs The Mali-G510 & G310: Attacking the low-end
Comments Locked

30 Comments

View All Comments

  • ForNein - Tuesday, May 25, 2021 - link

    Which Mali gpu is Google's Whitechapel going to have? I suspect it will be something older and unimpressive.
  • EthiaW - Tuesday, May 25, 2021 - link

    Google has floudered in its every hardware project, don't hold expectation in them friend.
  • Kangal - Thursday, May 27, 2021 - link

    They're somewhat incompetent in software too. Heaps of easy and obvious bugs in every platform in every update... at least when compared to their competitors. Also their biggest successes were projects that was kickstarted externally, and simply acquired.

    Do they have poor QA Department or something?
    Or maybe it's just they have some of the worst management team in the world, where they release half-finished projects to the world, only to double-back and kill them off in a small timeframe.
  • Spunjji - Friday, May 28, 2021 - link

    Their internal structure rewards doing novel things, not sustained success.
  • Fulljack - Wednesday, May 26, 2021 - link

    probably last year's G78, but it's also possible to use this new G710.
  • tuxRoller - Thursday, May 27, 2021 - link

    According to xda (https://www.xda-developers.com/google-pixel-6-same... it looks to be G78.
  • SarahKerrigan - Tuesday, May 25, 2021 - link

    "The only other surviving GPU IP vendor"

    This is not accurate. Think Silicon is one exception; Verisilicon, which inherited the Vivante product family, is arguably another, though it hasn't seen much movement since the GC8k came out.
  • Infy2 - Tuesday, May 25, 2021 - link

    No ray tracing support? Boo!
  • Wereweeb - Wednesday, May 26, 2021 - link

    If you want a frying pan functionality, just ask.
  • Spunjji - Friday, May 28, 2021 - link

    ARM have a primary focus on area efficiency, spending area on RT features that are too slow to use would not be a good decision for them.

Log in

Don't have an account? Sign up now