There has been a strong desire for a series of industry standard machine learning benchmarks, akin to the SPEC benchmarks for CPUs, in order to compare relative solutions. Over the past two years, MLCommons, an open engineering consortium, have been discussing and disclosing its MLPerf benchmarks for training and inference, with key consortium members releasing benchmark numbers as the series of tests gets refined. Today we see the full launch of MLPerf Inference v1.0, along with ~2000 results into the database. Alongside this launch, a new MLPerf Power Measurement technique to provide additional metadata on these test results is also being disclosed.

The results today are all focused around inference – the ability of a trained network to process incoming unseen data. The tests are built around a number of machine learning areas and models attempting to represent the wider ML market, in the same way that SPEC2017 tries to capture common CPU workloads. For MLPerf Inference, this includes:

  • Image Classification on Resnet50-v1.5
  • Object Detection with SSD-ResNet34
  • Medical Image Segmentation with 3D UNET
  • Speech-to-text with RNNT
  • Language Processing with BERT
  • Recommendation Engines with DLRM

Results can be submitted into a number of categories, such as Datacenter, Edge, Mobile, or Tiny. For Datacenter or Edge, they can also be submitted into the ‘closed’ category (apples-to-apples with same reference frameworks) or the ‘open’ category (anything goes, peak optimization). The metrics submitted depend on single stream, multiple stream, server response, or offline data flow. For those tracking MLPerf’s progress, the benchmark set is the same as v0.7, except with the requirement now that all DRAM must be ECC and steady state is measured with a minimum 10 minute run. Run results must be declared for what datatypes are used (int8, fp16, bf16, fp32). The benchmarks are designed to run on CPU, GPU, FPGA, or dedicated AI silicon.


The companies that have been submitting results to MLPerf so far are a mix of vendors, OEM partners, and MLCommons members, such as Alibaba, Dell, Gigabyte, HPE, Inspur, Intel, Lenovo, NVIDIA, Qualcomm, Supermicro, and Xilinx. Most of these players have big multi-socket systems and multi-GPU designs depending on what market they are targeting to promote with the results numbers. For example, Qualcomm has a system result in the datacenter category using two EPYCs and 5 of its Cloud AI 100 cards, but it has also submitted data to the edge category with an AI development kit featuring a Snapdragon 865 and a version of its Cloud AI hardware.

Qualcomm's Cloud AI 100

The biggest submitter for this launch, Krai, has developed an automated test suite for MLPerf Inference v1.0 and run the benchmark suite across a number of low-cost edge devices such as the Raspberry Pi, NVIDIA’s Jetson, and RockChip hardware, all with and without GPU acceleration. As a result, Krai provides over half of all the results (1000+) in today’s tranche of data.  Compare that to Centaur, which has provided a handful of data points for its upcoming CHA AI coprocessor.

Because not every system has to run every test, there’s not a combined benchmark number to provide. But taking one of the datapoints, we can see the scale of the results submitted so far.

On ResNet50, with 99% accuracy, running an offline dataset:

  • Alibaba’s Cloud Sinian Platform (two Xeon 8269CY + 8x A100) scored 1,077,800 samples per second in INT8
  • Krai’s Raspberry Pi 4 (1x Cortex A72) scored 1.99 samples per second in INT8

Obviously certain hardware would do better with language processing or object detection, and all the data points can be seen at MLCommon’s results pages.

MLPerf Inference Power

A new angle for v1.0 is power measurement metadata. In partnership with SPEC, MLPerf has adopted the industry standard SPEC PTDaemon power measurement interface as an optional data add-on for any submission. These are system-level metrics, rather than simply chip level, which means that extra controllers, storage, memory, power delivery, and the efficiencies therein all count towards the data measurement submitted.

MLPerf provides the example of a Gigabyte Server with 5x Qualcomm Cloud AI 100 cards averaging 598 W during an offline test for 1777.9 queries per second. Submitters are allowed to provide additional power data in submission details, such as processor power, however only the system-level power will be part of the official submission process.

Around 800 of the submitted data points in today’s list come with power data. Again, most of them from Krai.

Full results can be found at the MLCommons website.

Related Reading

Comments Locked


View All Comments

Log in

Don't have an account? Sign up now