Testing the New 3DMark CPU Benchmark: For the Boidsby Dr. Ian Cutress on July 15, 2021 9:00 AM EST
- Posted in
A couple of weeks ago, UL (formerly Futuremark) released the latest test in its ongoing 3DMark gaming benchmark suite, CPU Profile. The premise behind this new CPU-specific test is a simulation to measure how processor performance scales with cores and threads. Normally 3DMark tests are designed to measure overall gaming performance – and thus are largely a GPU benchmark – however this one is a little different since it focuses more specifically on CPU performance. So we wanted to take a look at UL's latest test to get a better idea of what exactly it is testing, what exactly it is trying to accomplish, and just how useful it might be.
UL’s 3DMark and the New Test
The 3DMark software (with the tagline 'The Gamer's Benchmark') has been a staple of the synthetic benchmarking community for its variety of tests designed to emulate different levels of gaming complexity. From a singular interface, users can run simple tests aimed at mobile and integrated graphics performance, to mid-level gaming at reasonable resolutions and detail, up to overengineered tests for systems that don’t exist yet. Each of the tests provides a baseline set of graphics calculations designed to emulate video game performance and produces a composite number to represent that performance for that market. If you’ve ever heard of Time Spy or Fire Strike, two popular benchmarking tests particularly for overclockers, then 3DMark is where it comes from.
3DMark also acts as a vehicle for new feature tests. Over the years UL has introduced separate specific tests to find draw call limitations, DirectX Raytracing processing performance, Variable Rate Shading (VRS) performance, PCIe 4.0 testing, and NVIDIA DLSS performance. The newest test to this portfolio is the CPU Profile, the point of this article.
What is the CPU Test Measuring
The CPU Profile test showcases a simple low resolution scene derived from the imagery of the latest gaming tests. The rate limiter of this scene is the raw CPU calculations in the background – the test runs an effective 150 frames of images, however each frame involves a parallel compute framework based on the flocking of birds.
Bird flocking, also known in simulation as boids (bird-oid object, not an accent thing), involves the interaction of a large number of objects in movement to each other depending on small random movement and rules regarding separation, alignment, and cohesion. Each boid has to:
- be wary of its distance to other boids in a pack (separation),
- the direction of travel relative to others (alignment), and
- the desire to move towards an average position within line of sight (cohesion)
We’ve all seen how birds move in mass flocks, or fish in shoals, and there are actual mathematical models that can be used to simulate it. A minor adjustment in separation, alignment, and cohesion can adjust exactly how they all interact and move.
From a simulation standpoint, each boid is independent in its movements such that it can be calculated in parallel to others, however each boid needs to have knowledge of its local environment and the positions and directions of other boids inside that environment. The more boids in the local environment, the bigger the lookup table for that individual has to be – the size of that lookup table on each time step is often a mix between separation distance and line of sight: the more objects an individual can see/is interacting with at once, the bigger that calculation. The data for this lookup table has to be polled from many different places in cache and memory, almost at random, and for perfect simulation, on every timestep as well.
Boids with simple edge boundary conditions
Beyond that, boid simulation isn’t usually run on CPU cores anyhow. Users can interact with a GPU version in their browser today, with 65000+ boids running very happily.
So with all this talk about boids, the CPU Profile test in 3DMark is doing exactly this simulation exclusively*. The workload outlined on 3DMark’s states that they have a simple, highly optimized simulation of boids split into two parts.
- One: Half the boids use SSSE3 optimized instructions
- Two: Half the boids use AVX2 optimized instructions, otherwise SSSE3
The benchmark does six separate sub-tests based on the number of threads: 1, 2, 4, 8, 16, max. Rather than giving an overall score, the test hands the user six different scores, based on a simple calculation:
- Score = 350,000 / average frame time
The simulation lasts for a fixed 150 frames, so each sub-test has the same fixed calculation simulation (and we assume the same fixed seeds for RNG). On the fastest processors, the max threads section can take under 10 seconds, allowing the simulation to run with CPUs operating entirely within turbo clockspeeds (we’ll get back to why this matters later), while the single thread section on the slowest processors can take five minutes or so.
The end results page is something that looks like this:
The test gives you six different results along with a system info tracker if it was enabled.
The ultimate purpose of the test being to benchmark CPU performance at several different thread counts, making a test that can scale up to use all of the threads a consumer CPU can provide, but also offers a look at performance with lower thread counts, which is where many games lie today. Put another way, on deciding whether to have a single-threaded or multi-threaded gaming test, UL decided to do both by testing with multiple thread counts.
*On launch UL's website said the test was in two parts with a physics engine, however UL has clarified to us in email that this was a copy/paste error from a previous test. The website has since been updated.
Discussion of the Test
Normally when probing a new test for our benchmark suite, it pays off to take a critical eye to what exactly the test is measuring and how it relates to the real world. Every benchmark has a place in a review lineup, though it is always important to quantify where it should be, and what weight should be given to the results. For example, we have real-world tests that assist in performance on that software, but we also have a mix of synthetic tests for overall performance perception. Usually we focus more on the real world tests for analysis and recommendation, but the small portion of synthetics help in maintaining baselines and for those that want to see them.
Normally we filter 3DMark’s gaming tests into that latter portion of synthetic testing. With the same program version and the same video drivers, we can see how different processors and graphics cards scale in light of the synthetic workload, even if the synthetic workload is trying to emulate an average gaming experience. UL has been quite clear that the goal of 3DMark’s gaming tests is to do just that – emulate real world performance.
Unfortunately, the commentary around the CPU Profile test is rather unclear. You might be forgiven for thinking that the test is designed to showcase where a processor might be limited in gaming; after all the test is shipped alongside a half-dozen other GPU gaming tests and during the test itself, we’re treated to some very game-looking imagery.
The arrows on the left look to be boids (300-ish?), but unsure if related to the simulation at all
In practice, it's unclear whether the images shown on screen have anything to do with the simulation at hand (while UL has responded to a couple of emails, they haven’t answered this directly yet). We only see 300 or so boids on screen, and yet a simple simulation on a single core of a Core i7-6950X can easily do a few thousand.
If we go into UL’s press release for the test, the headline for the page is ‘New CPU benchmarks for gamers and overclockers’, the page describes that it runs a CPU simulation across 1,2,4,8,16, max threads. For each of those sub-tests, it also gives a brief indication of what the test is useful for. Here is our summary of UL’s press release on the sub-tests:
- 1 Thread: Raw CPU performance, but others scores are better indicators of gaming.
- 2 Threads: Best for DX9 games such as DOTA2, League, and CS:GO
- 4 Threads: Best for DX9 games such as DOTA2, League, and CS:GO
- 8 Threads: Modern DX12 games, correlates will with 3DMark TimeSpy
- 16 Threads: Computational tasks, less relevant for gaming
- Max Threads: Full Performance, not relevant for gaming
In gaming workloads, we would generally agree with this. However, the underlying workload used in CPU Profile is not a gaming workload. This is where the confusion kicks in. UL says that its boid simulation is akin to similar situations in games, even to the point where having half using SSSE3 and half using AVX2 is more akin to game engines using different optimizations; however it completely skips over the fact that in every one of its sub-tests, the ‘game’ is CPU limited, even at 8 threads, and at 16 threads. This is fine for a CPU-speciifc test, but it is naive of how most games function on high-end hardware.
As mentioned above, UL hasn’t stated how dense its boid simulation is, nor how it scales; by AnandTech’s estimates you need at least 2000+ to saturate a single thread with unoptimized code, so with optimized code scaled across 8 threads or 16 threads, we need to be looking at 50000 or 100000 flocking objects in a simulation space. For games that showcase boid flocking environments, most of them are using secondary physics, i.e. unable to be influenced by the character, but those that do have interacting physics, they are unlikely to be simulating on this scale. Moreover, there’s nothing to say that a game engine won’t simply increase/decrease the boids in the simulation based on performance.
Orthogonal to all of this is the length of the test. Because the test is a fixed 150 frames regardless of how many threads are working, it means the best processors can churn through the max threads in a few seconds, while the slowest processors take several minutes in 1T mode. The discussion point here is down to how each processor induces its Turbo modes.
At various times in the past decade, Intel and AMD has privately expressed concern for big max thread workloads that take only a few seconds to complete - usually max thread workloads require enough time for a processor to hit a steady state frequency, and so completing within the turbo window makes the test an unrepresentative metric. Take, for example, CineBench R20 that can complete in 5 seconds with a higher average pixels per second than a Cinema4D test that might take a few hours. Moreover, gaming is a while ride of turbo results, and not a fixed workload constantly at turbo. If Intel and AMD have previously stated that these sorts of in-turbo max thread tests are irrelevant for performance comparisons, then the new CPU Profile test would fall to a similar fate.
We approached UL with this, along with the idea that the CPU Profile simulation should be a fixed time instead of a fixed set of frames, but in the end UL disagreed. One of the goals of the test was apparently having a short test length. They wanted the 8 thread result to correlate to Time Spy Extreme results, which meant finding a time that worked while also being short was a goal of the project. UL also stated that a 150 fixed frame test results in a fixed amount of work, and suggested that slower systems will process less with fixed time steps - I should point out this is irrelevant if you're taking an average when fixed time steps are in place. Over 150 frames, UL stated they could guarantee a balanced workload across all threads (something which doesn't happen in gaming), and beyond that the consistency of the test would diverge in its results.
Ultimately, I disagree with some of UL's choices here, and find that a lot of these arguments seem arbitrary at best – especially given my own experience in building our in-house tests such as 3DPM (which incidentally does do fixed time, not fixed compute). This also means that I’m having a hard time correlating what this benchmark is doing to how a user can interpret the results for a gaming workload. What UL has done here is create a CPU benchmark, first and foremost, and it appears that simply using a simulation mechanic ‘that can be used in games’ is being described as a tool to help identify gaming performance. At this stage, with the knowledge I have at hand, I remain unconvinced that the workload is gaming-relevant.
Typical for a UL benchmark, CPU Profile generates a series of dimensionless scores. These scores directly correlate to the underlying benchmark, but they aren't a specific measurement in and of themselves. Complicating matters a bit for CPU profile, the benchmark generates half a dozen scores – so unless you read the documentation, the data can come off as glut of numbers that are lacking context.
Taking a look at these numbers, UL states on its website that the results help showcase the result compared to others, but also the overclocking potential for your processor. This is a hint that this benchmark is actually better for overclockers than anyone else, as having six different results numbers and six different recommendations for CPU overclocking doesn’t help how to interpret gaming much, especially given the bar showcasing the score is quite small and not offering any additional context.
Results from one of our CPUs, hard to see those bars
There’s also the matter of presenting the result as a score. All of UL’s tests give a score at the end, and as we’ve showcased above the results for this test a calculation of an arbitrary number (350000) divided by the average frame time (in milliseconds). The reason for not giving the results as a raw frame time is simple psychology – bigger numbers look better on graphs and are easier to interpret. So by dividing a number by the average frame time, everything gets a scale. It also helps that removing the units of the result might reduce confusion. The downside here is that the initial number is very arbitrary.
On the website, UL calls it a reference value using ‘a time constant set to 70 multiplied by a score constant set to 5000’, which comes to 350000. There are no explanations as to why these numbers exist, though we can interpret that 70 meant to be 70 milliseconds, and if a score achieves 70 milliseconds (note you need an 8 core processor to get that) then the final result is 5000 points. Almost all processors in all sub-tests will score under this, showcasing that the pivot for the results scaling is actually higher than most processors will achieve.
With the data, UL could have simply represented the data as an average frame rate. For example, here are some results for the Ryzen 7 2700X, an 8 core/16 thread processor, running at stock with JEDEC memory. The table showcases the raw average frame time, UL’s score, and an average frame rate metric.
|3DMark CPU Profile
AMD Ryzen 7 2700X
Frame Time (milliseconds)
|3DMark CPU Score||Average
Frames Per Second
|1T||660.9 ms||530||1.5 fps|
|2T||380.8 ms||919||2.6 fps|
|4T||217.3 ms||1611||4.6 fps|
|8T||121.5 ms||2881||8.2 fps|
|16T||78.6 ms||4453||12.7 fps|
|nT||78.0 ms||4487||12.8 fps|
Note that if your game is running at 12 frames per second on a Ryzen 7 2700X, then something is set too high anyway.
But as we start listing multiple processors, this data gets excessive and dense very fast.
|3DMark CPU Profile
Results Given as Average FPS
How should we order this table? Should it be ordered by 1T results, or by max thread results? If we’re focusing on gaming, perhaps we should order by 2T/4T or 8T instead, which makes the other results extra data that we’re discarding for being irrelevant or making it too complex. As is usually the case, the downside to offering multi-dimensional data – in this case, results with multiple quantities of threads – is that it becomes a whole lot harder to present it in a simple manner.
To date I’ve run the test on 24 processors, from a 64-core Threadripper down to a dual core Apollo Lake. Rather than a table of results, these results are ordered in which processor scores the highest for each of the sub-tests. There’s even a Sandy and Ivy Bridge in there.
The resulting graph is quite noisy, especially as the fastest high thread count processors are not the fastest low thread count processors (and vice versa). Ultimately a graph like this might look better with just a few elements on it, such as here:
This showcases that the Core i9-11900K scores best on this test, until it hits 16 threads when the extra memory bandwidth of the 3990X takes over. It should be noted that Tiger Lake does abysmal on this test, just behind the R9 3950X in 1T and behind the i3-9100F in max threads, as the power limits of the mobile processor matter more than the extra threads. I will need to check with a U-series AMD to see what the difference is here.
By and large when we scale out to more threads, we see that having a more complete system helps on this test, however in the single threaded mode, it doesn’t all seem to be about IPC, which is perhaps one of the limits of the boid simulation. We can actually see the Core i3 perform better in 2T/4T compared to the Ryzen 9 3950X, perhaps due to cross-thread talk over the chiplets being more of a concern.
The benchmark in full: unsure if any of this relates to what's actually being calculated...
How this all relates to gaming though is a question left unanswered. It’s a strong CPU test, and as a simulation of flocking behavior, has the right elements for a scientific workload worth examining. However, interpreting the performance scaling as a function of gaming performance with a CPU-limited workload isn’t really relevant here, I feel – at least not without more information from UL about how they are interpreting this test. We have been emailing with UL back and forth to understand the test, and we are waiting to see if any further information will be made available. The thing is that most games that are CPU limited, especially DX9 esports titles, are bottlenecking on draw calls from the processor to the GPU, and aren't waiting on CPU compute except in a few fractional scenarios. This makes the CPU Profile test more for the extreme overclockers in that regard, trying to eke out performance across CPU and memory.
Post Your CommentPlease log in or sign up to comment.
View All Comments
philehidiot - Thursday, July 15, 2021 - linkIt'll be used to sell CPUs for gaming, I'm sure. It may not be relevant to gaming, but it's very pretty.
If I ever create a benchmark, it'll be based on the Hypnotoad.
Allll glllooory to the Hypnotoad.
RSAUser - Thursday, July 15, 2021 - linkThis test seems to correlate with L1 cache (and then also cross chiplet latency), since need to do comparison with other threads? This doesn't seem like something that will correlate well with games as work like this would have a central thread orchestrating all the others since all dependent on the user input, and then also as mentioned the bottleneck is usually draw calls etc., not this.
I would guess that AMD's next gen (4) will then render this test a bit worthless as they're intending to stack memory, don't know whether that will have an increase in L1 though.
philehidiot - Thursday, July 15, 2021 - linkThat cross chiplet latency is making me want a new CPU that I don't need. It's interesting that the raw scores will probably be used in reviews yet in the majority of games, something cheaper *may* be as good or better. It's about buying the right CPU for the job, not what scores highest on raw benchmarks.
mode_13h - Thursday, July 22, 2021 - linkThe AMD CPUs with a single compute die seem to hold up well, until the # of threads exceeds the physical core count. Specifically, the R7 2700X, the R9 4900HS, and the R7 3700X. That supports the notion that core-to-core latency is important.
To be honest, that seems odd to me. If I were implementing something like this, I'd compute the new position of each boid based on the results of the previous iteration. So, those previous results would be read-only and therefore wouldn't require any locking or synchronization to access. Then, you'd just batch the boids and enqueue the batches for a bunch of worker threads to consume. If the batch size is sensible, there shouldn't be much lock contention on the work queue, and that lock contention is likely where core-to-core latency becomes important.
Apart from lock-contention, the only other reason I can see why core-to-core latency would matter is if you have false-sharing (i.e. the datastructures for the different boids aren't aligned on 64-byte boundaries).
Arbie - Thursday, July 15, 2021 - linkTypo: after "You might be forgiven" you forgot "for thinking that".
I was hoping it was a blanket dispensation, but don't think you really meant that.
Ryan Smith - Thursday, July 15, 2021 - linkThanks!
(And I forgive you)
AusMatt - Thursday, July 15, 2021 - linkYour results graph (https://images.anandtech.com/doci/16817/Graph1.jpg... is mis characterising the AMD R5 5600X as "Zen2" instead of "Zen3".
HarryVoyager - Friday, July 16, 2021 - linkInteresting. The only games I know of that are typically CPU bound are flight sims, probably through a combination of high numbers of draw calls, and all the physics simulation going on in the background. And, as near as I can tell, they are all using AVX2 vector math extensions. It was enough that Zen 1/1+ chips just weren't usable, because they didn't support AVX.
Unfortunately, there aren't very many good benchmarks for flight sims. The closest thing I've found is the Il-2 SYN_Vander Benchmark v6 from the forums. While the results are repeatable, it is not a quick and easy benchmark to run, and it does seem to be very sensitive to memory latency. (Going from a 3800X to a 5800X was something like a 40% boost on my CPU performance).
It will be interesting to see if the single thread numbers correlate at all with the Vander results.
Slash3 - Sunday, July 18, 2021 - linkZen / Zen+ both feature full AVX and AVX2 feature compatibility. The older Phenom CPUs, however, do not, which has contributed (further) to making them age rather poorly.
Sims love IPC, as most rely on a single core physics thread which the rest of the engine piggy backs onto. The changes to instruction pipelining and cache improvements on Zen 3 are fantastic for that, as you say. Some other games that really stretched their legs are Starcraft 2 and Guild Wars 2. Both of these saw a near 50% improvement gen on gen. Good stuff.
mode_13h - Thursday, July 22, 2021 - linkPhenoms performed on par with Core2 and aged just about as well. Instruction-wise, I think Phenom only went up to SSE3, which is not quite as far as Core2 went.