TDP Power Cap

What makes these new Opterons truly intriguing is the fact that they will offer user-configurable TDP, which AMD calls TDP Power Cap. This means you can buy pretty much any CPU and then downscale the TDP to fit within your server’s power requirements. In the server market, the performance isn’t necessarily the number one concern like it is when building a gaming rig. As all the readers of our data center section are aware, what really counts is the performance per watt ratio. Servers need to be as energy efficient as possible while still providing excellent performance. 

John Fruehe (AMD) states, "With the new TDP Power Cap for AMD Opteron processors based on the upcoming 'Bulldozer' core, customers will be able to set TDP power limits in 1 watt increments." It gets even better: "Best of all, if your workload does not exceed the new modulated power limit, you can still get top speed because you aren’t locking out the top P-state just to reach a power level."

That sounds too good to be true: we can still get the best performance from our server while we limit the TDP of the CPU. Let's delve a little deeper.

Power Capping

Power capping is nothing new. The idea is not to save energy (kWh), but to limit the amount of power (Watt) that a server or a cluster of servers can use. That may sound contradictory, but it is not. If your CPU processes a task at maximum speed, it can return to idle very quickly and save power. If you cap your CPU, the task will take longer and your server will have used about the same amount of energy as the CPU spends less time in idle, where it can save power in a lower p-state or even go to sleep (C-states). So power capping does not make any sense in a gaming rig: it would reduce your fps and not save you any energy at all. Buying CPUs with lower maximum TDP is similar: our own measurements have shown that low power CPUs do not necessarily save energy compared to their siblings with higher TDP specs. 

In a data center, you have lots of servers connected to the same power lines that can only deliver a certain amount of current at a certain voltage (48, 115, 230 V...), e.g. amps. You are also limited by the heat density of your servers. So the administrator wants to make sure that the cluster of servers never exceeds the cooling capacity and the amps limitations of the power lines. Power capping makes sure that the power usage and the cooling requirements of your servers become predictable.

The current power capping techniques limit the processor P-states. Even under heavy utilization, the CPU never reaches the top frequency.  This is a rather crude and pretty poor way of keeping the maximum power under control, especially from a performance point of view. The thing to remember here is that high frequencies always improve processing performance, while extra cores only improve performance in ideal circumstances (no lock contention, enough threads, etc.). Limiting frequency in order to reduce power often results in a server running far below where it could in terms of performance and power use, just to be "safe".

Overview of Bulldozer Lineup Bulldozer's Power Management
Comments Locked

59 Comments

View All Comments

  • ltcommanderdata - Friday, July 15, 2011 - link

    "According to leaked product positioning slides, Zambezi is aimed to fight against Intel's Core i5 and i7 lineups. Zambezi will feature up to eight cores, which is twice as many as i7-2600(K)'s four cores. AMD said that they won't join the Hyper-Threading club and they will deliver as many physical cores as Intel delivers physical and virtual cores combined. It looks like AMD is keeping their word, though they're only delivering half as many "FP/SSE cores". "

    With hyperthreading and now Bulldozer's double integer core/shared FPU design, core counts are becoming increasingly a difficult metric to compare. It's important to note that while Bulldozer has doubled the number of integer cores compared to Istanbul, each integer core is actually weaker since Bulldozer only uses 2 non-symmetric ALUs and 2 AGUs compared to 3 symmetric ALUs and 3 AGUs in Istanbul. Perhaps other architectural efficiencies can make up the difference, but I wouldn't be surprised if clock-for-clock each of Bulldozer's integer cores is slightly slower than Istanbul's. I believe Sandy Bridge's integer performance is clock for clock better than Istanbul, so Bulldozer likely need very well threaded code for it's doubled integer cores to shine.

    FPU resources look to be be beefed up from 3 units in Istanbul to 4 units in Bulldozer. Compared to Sandy Bridge, Intel's big advantage is native 256-bit AVX units compared to Bulldozer which only has 128-bit FP/SSE resources and needs to split 256-bit AVX instructions halving performance. So if Intel can convince developers to quickly adopt 256-bit AVX, Sandy Bridge should have a pretty large SIMD advantage.
  • duploxxx - Friday, July 15, 2011 - link

    dude, you just sound like a horrified Intel fanboy. "convince developers to adopt 256bit AVX). Then what about FMA3 and FMA4 which intel doesn't even have.....

    A single BD Module can handle a 256bit AVX or can deside to split into 2 x 128 for each core . It is a decision from AMD to go that way just like intel decides to have a 256bit full for a PH + HT core..... 2 x 256 logic would just need more die space without usage, just like the choice to go for 2 ALU/AGU while the usage of 3 is almost no gain in server loads besides benchmarking....

    While the FPU 128+128 might be a bit slower we are talking here about perhaps 2-3% since all other parts like cache and memory are shared for a single module and very neglictable difference unless you are a fanboy which is obvious.
  • ltcommanderdata - Friday, July 15, 2011 - link

    "Then what about FMA3 and FMA4 which intel doesn't even have....."

    I believe Bulldozer supports FMA4, but not FMA3 due to Intel flip-flopping on which one they'll support at the last minute breaking commonality. While FMA4 is a great capability to have, you pointing out that Intel doesn't have it is the concern. AVX could see faster adoption because it's supported by both Bulldozer and Sandy Bridge.

    "While the FPU 128+128 might be a bit slower we are talking here about perhaps 2-3% since all other parts like cache and memory are shared for a single module and very neglictable difference unless you are a fanboy which is obvious."

    I mention AVX performance, because I'm under the impression that Bulldozer gangs it's two 128-bit FMACs together to do 1 AVX per module per cycle while Sandy Bridge has 3x256-bit AVX units per physical core. Sandy Bridge's AVX units are non-symmetric and there are no doubt other factors that will impact performance so it won't be a 3x performance difference, but I'd think it'd be more than 2-3% given the big difference in raw processing resources.
  • duploxxx - Friday, July 15, 2011 - link

    my 2-3% was only the difference between a single 256 vs 2 x 128, not against the intel part... lets see first how much AVX will be really used and how much will end up being 128 bit... doesn't mean something which is 256bit is always better then 128bit.
  • silverblue - Friday, July 15, 2011 - link

    I believe I heard once that Intel's implementation can execute either one 128-bit or one 256-bit instruction per clock. Bulldozer's fused implementation may give up on AVX throughput, but only AVX.
  • rnssr71 - Friday, July 15, 2011 - link

    'It's important to note that while Bulldozer has doubled the number of integer cores compared to Istanbul, each integer core is actually weaker since Bulldozer only uses 2 non-symmetric ALUs and 2 AGUs compared to 3 symmetric ALUs and 3 AGUs in Istanbul.'

    why does everyone get hung up on this? yes, phenom had 3 ALUs and 3 AGUs. big deal! it could only complete 3 instructions per clock- any combination of ALU and AGU instructions but no more than 3. so how often could it process 3 ALUs consecutively?
    AMD has said that removing the 3rd AGU won't hurt performance and core 2, nehalem, and sandy bridge all have 2 AGU's.
    Bulldozer can complete 4 instructions per clock- same as core 2, nehalem and sandy bridge. granted, the all have 3 ALU's available, but how often is the extra one used?
  • SanX - Friday, July 15, 2011 - link

    Got kids Phenom II X6 1055T based PC for their games like GTA and just for fun ran on it some scientific FP-oriented tests - parallel algebra codes and some single-core ones.
    Was shocked that at its 2.8GHz stock clock it is twice faster then my overclocked to 4GHz Intel processors. Is this what you guys get too? Kind of contradicting to all these game- and office-oriented and benchmarks where Intel is always on the top.

    So i'm waiting for these 8-core 32nm chips in the hope to drive them to 4.5 GHZ and get additional factor of 2

    Anyone wants to repeat them ?
  • cosminmcm - Friday, July 15, 2011 - link

    You mean compared to your Intel Pentium 4 @ 4 GHz?
  • GaMEChld - Friday, July 15, 2011 - link

    I too am curious as to what Intel chip was used in that comparison.
  • beginner99 - Friday, July 15, 2011 - link

    Most certainly a dual core with 1/3 of the cores or one of the slowest Core 2 Quads. Sure not a nehalem or sb Quad

Log in

Don't have an account? Sign up now