Bulldozer's Power Management

AMD confirmed that the power management of the Bulldozer core is an improved version of the power management improvements that are part of the “Llano” CPU. Just like Llano, Bulldozer has a Digital APM Module. The APM modules samples a number of performance counter signals and these samples are used to estimate dynamic power with 98% accuracy. Now combine this power estimate with Bulldozer's power gating at the module level and vastly improved clock gating and you can start to understand what is possible. 


Bulldozer reduces the number of active and power consuming circuits by vastly improved clock gating

If your application runs only one or a few threads on your 8-module, 16-core Interlagos CPU, several of those modules might be power gated. Or if you run integer-only threads, the fact that quite a few unused parts (i.e. the FPU) of the module will be clockgated might be enough to stay under the configured TDP. So in those cases, it won't be necessary to limit the clock speed. And that is really great, especially in the real world.

In the real world, only a few HPC application behave like the SPEC CPU rate benchmarks, which spawns threads accross all cores.  Most server applications do not fully utilize all available cores all the time. Sometimes, only one thread will be really critical and the perceived application performance will depend on it. A little bit later several threads might demand CPU power (but not all cores will be busy). Only a certain percentage of the time are all the cores used. That is exactly the reason why the cheaper Magny-Cours make so much sense for HPC applications, yet it struggles to keep up with the higher clocked, higher IPC Xeon Westmere cores when running OLTP and ERP applications. Putting a power cap on a Magny-Cours means even lower frequencies, and as a result even higher response times (as we have measured here). 

By adding power consumption measurements to the CPU, Bulldozer will run most server applications at full speed unless you lower the TDP too far. (Obviously, if the TDP is lowered enough, the CPU will not be able to operate at higher frequencies, thus degrading the response time performance too.) The maximum throughput will be a little bit lower, but most server applications almost never run at maximum throughput. In fact, maximum throughput only matters for HPC applications and benchmarks. For real human users, response times are the only thing that matter.

The beauty of this new power cap system is that in normal circumstances (e.g. the server is running at 40-70% load), the response times will hardly be any longer. At the same time, the adminstrator can make sure that the server cluster does not exceed the capacity of the cooling equipment and the power lines.

This TDP Power Cap technology could be very interesting to small and medium businesses too, and not only to owners of large server clusters. TDP Power Cap could be a way to make sure that your collocated servers never exceeds the maximum amount of amps allocated to you, and as result you will not have to pay unexpected high electricity bills. However, whether or not this ideal world of low response times and low electricity bills will become a reality for the Bulldozer server owners will also depend on the availability of a good and decently priced management software tool that allows the administrator to configure the TDP on all servers simultaneously.

On a standard server, you will get a section in BIOS that allows you to tweak the TDP in 1W increments (or a maximum of 64 power settings), a good step forward compared to the current p-state setting. But to control a server cluster in an efficient way, good management software is needed. Currently, you either have to buy all your servers from the same vendor (HP for example) and then pay for management software such as HP's Insight Control software. To really unlock this technology, AMD or one of their partners needs to make sure this kind of software is widely available--some open source code perhaps?

TDP Power Cap Final Thoughts and AMD's Future Plans
Comments Locked

59 Comments

View All Comments

  • duploxxx - Friday, July 15, 2011 - link

    according to many, anything which is branded "PENTIUM" is the uber CPU doesn't matter what is behind....
  • Broheim - Friday, July 15, 2011 - link

    >according to many

    source?

    don't have one? then gtfo.
  • formulav8 - Friday, July 15, 2011 - link

    Grow up. He was just messing around
  • Broheim - Friday, July 15, 2011 - link

    no, he's a raging AMD fanboy. I have yet to see a single post from him that doesn't bash intel or praise AMD in some form or another.
  • AnandThenMan - Friday, July 15, 2011 - link

    So he's the exact opposite of you.
  • Broheim - Saturday, July 16, 2011 - link

    erm, I have nothing against AMD, this rig has an unlocked HD6950...

    are you just butthurt because I called you out on your bitching about Anand's benchmarking?
  • just4U - Saturday, July 16, 2011 - link

    Currently I am on a Sandy Bridge 2500k and in the last year I've been on a i7 920 a 1055T, and a few $60 amd cheapies. As far as I am concerned they are all good. I didn't notice night and day improvements like I did when I moved to the A64 and Core2. So I think we are sort of at a ceiling limit right now (excepting specific tasks) where just about any new cpu is good enough.
  • JohanAnandtech - Friday, July 15, 2011 - link

    it is possible that your tests are using the x87 FPU. The Phenom can process up to 3 instructions per cycle out of order, while the P4 can hardly sustain one FP per cycle.

    Parallel, multithreaded software is of course much faster on a 6-core than a single P4 core :-).

    And it would be very hard to find a benchmark where P4 at 4 GHz is faster than a Phenom II 2.8 GHz. I can not imagine that anyone has published one. The P4 has a much slower memory interface (very high latency vs Phenom IMC), much smaller caches (16 KB vs 64 KB L1) and is outmatched in every aspect of FP processing power (64 vs 128 SIMD, Tripple fast x87 FPU vs single slow one) ...
  • SanX - Friday, July 15, 2011 - link

    Amazing was that performance increase by factor of two was per CPU of course. The whole 6-core not overclocked AMD CPU was 2.42/0.50 or almost 5 times faster then 2-core overclocked to 3.8GHz Intel E8400!

    Here are the numbers for the parallel algebra (you can download the test code from equation dot com or i have it too for different compilers) for Intel and AMD in seconds when i switch ON different amount of cores

    1 4.64 seconds
    2 2.42

    1 2.46
    2 1.22
    3 0.83
    4 0.67
    5 0.58
    6 0.50

    I invite anyone to do the test on their CPUs.
  • JarredWalton - Friday, July 15, 2011 - link

    Using 64-bit "bench1_gfortran_64.exe":

    Core 2 QX6700 @ 3.2GHz:
    1 CPU = 4.55s
    2 CPU = 2.33s
    3 CPU = 1.62s
    4 CPU = 1.34s

    Core i7-965 @ 3.6GHz:
    1 CPU = 3.93s
    2 CPU = 1.97s
    3 CPU = 1.33s
    4 CPU = 1.01s
    5 CPU = 0.87s
    6 CPU = 0.80s
    7 CPU = 0.72s
    8 CPU = 0.69s

    Of course, none of that really tells us much, because we don't know how the application was compiled or what optimizations are in place. There's only one 64-bit compiled version but there are four 32-bit compiled versions. Let's just see what happens with the 32-bit versions on the QX6700 for a second:

    Core 2 QX6700 @ 3.2GHz Absoft:
    1 CPU = 7.01s
    2 CPU = 3.54s
    3 CPU = 2.40s
    4 CPU = 1.90s

    Core 2 QX6700 @ 3.2GHz gfortran:
    1 CPU = 10.73s
    2 CPU = 5.40s
    3 CPU = 3.67s
    4 CPU = 2.87s

    Core 2 QX6700 @ 3.2GHz Intel Fortran:
    1 CPU = 4.70s
    2 CPU = 2.40s
    3 CPU = 1.76s
    4 CPU = 1.47s

    Core 2 QX6700 @ 3.2GHz Lahey/Fujitsu:
    1 CPU = 5.38s
    2 CPU = 2.73s
    3 CPU = 1.95s
    4 CPU = 1.56s

    What does that tell us? As expected, the Intel compiler version is the fastest in 32-bit mode. What's more, the gfortran 32-bit version is the slowest on Intel. Since the only 64-bit version is from gfortran, it would appear that a 64-bit Intel version would come in around twice as fast. That's only speculation based on the 32-bit compiled executables, but given your above numbers it looks like you're probably using the 64-bit version. (If not, why does my 3.2GHz quad-core outperform your 3.8GHz dual-core when looking at the 32-bit Intel speeds?)

    Anyway, there are certain types of code that AMD does quite well at running, but overall I'd say it's clear that Intel's Nehalem/Lynnfield/Sandy Bridge CPUs are significantly faster than the Phenom II X6 offerings.

Log in

Don't have an account? Sign up now