Floating point peak performance of Kaveri and other recent AMD and Intel chipsby Rahul Garg on January 22, 2014 8:30 AM EST
With the launch of Kaveri, some people have been wondering if the platform is suitable for HPC applications. Floating point peak performance of the CPU and GPU on both fp32 and fp64 datatypes is one of the considerations. At launch time, we were not clear on the fp64 performance of Kaveri's GPU but now we have official confirmation from AMD that it is 1/16th the rate of fp32 (similar to most GCN based GPUs except the flagships) and we have verified this on our 7850K by running FlopsCL.
I am taking this opportunity to summarize the info about Kaveri, Trinity, Llano and Intel's competing platforms Haswell and Ivy Bridge on both the CPU and GPU side. We provide a per-cycle estimate for the chips as well as peak calculated in gflops. The estimates are chip-wide, i.e. already take into account the number of cores or modules. Due to turbo boost, it was difficult to decide what frequency to use for peak calculations. For CPUs, we are using the base frequency and for GPUs we are using the boost frequency because in multithreaded and/or heterogeneous scenarios the CPU is less likely to turbo. In any case, we believe our readers are smart enough to calculate peaks at any frequency they want, given that we already supply per-cycle peaks :)
The peak CPU performance will depend on the SIMD ISA that your code was written and compiled for. We consider three cases: SSE, AVX (without FMA) and AVX with FMA (either FMA3 or FMA4).
|CPU frequency||3.7 GHz||3.8 GHz||3.0GHz||3.5GHz||3.5GHz|
|SSE fp32 (/cycle)||16||16||32||32||32|
|SSE fp64 (/cycle)||8||8||16||16||16|
|AVX fp32 (/cycle)||16||16||-||64||64|
|AVX fp64 (/cycle)||8||8||-||32||32|
|AVX FMA fp32 (/cycle)||32||32||-||128||-|
|AVX FMA fp64 (/cycle)||16||16||-||64||-|
|SSE fp32 (gflops)||59.2||60.8||96||112||112|
|SSE fp64 (gflops)||29.6||30.4||48||56||56|
|AVX fp32 (gflops)||59.2||60.8||-||224||224|
|AVX fp64 (gflops)||29.6||30.4||-||112||112|
|AVX FMA fp32 (gflops)||118.4||121.6||-||448||-|
|AVX FMA fp64 (gflops)||59.2||60.8||-||224||-|
It is no secret that AMD's Bulldozer family cores (Steamroller in Kaveri and Piledriver in Trinity) are no match for recent Intel cores in FP performance due to the shared FP unit in each module. As a comparison point, one core in Haswell has the same floating point performance per cycle as two modules (or four cores) in Steamroller.
Now onto GPU peaks. Here, for Haswell, we chose to include both GT2 and GT3e variants.
|Platform||Kaveri||Trinity||Llano||Haswell GT3e||Haswell GT2||Ivy Bridge|
|GPU frequency||720 MHz||800 MHz||600 MHz||1.3 GHz||1.25 GHz||1.15 GHz|
fp64 gflops (OpenCL)
fp64 gflops (Direct3D)
The fp64 support situation is a bit of a mess because some GPUs only support fp64 under some APIs. The fp64 rate of Intel's GPUs does not appear to be published but David Kanter provides an estimate of 1/4 speed compared to fp32. However Intel only enables fp64 under DirectCompute but does not enable fp64 under OpenCL for any of its GPUs.
Situation on AMD's Trinity/Richland is even more complicated. fp64 support under OpenCL is not standards-compliant and depends upon using a proprietary extension (cl_amd_fp64). Trinity/Richland do not appear to support fp64 under DirectCompute (and MS C++ AMP implementation) from what I can tell. From an API standapoint, Kaveri's GCN GPUs should work fine on for fp64 under all APIs.
Some of you might be wondering whether Kaveri is good for HPC applications. Compared to discrete GPUs, applications that are already ported and work well on discrete GPUs will continue to be best run on discrete GPUs. However, Kaveri and HSA will enable many more applications to be GPU accelerated.
Now we compare Kaveri against Haswell. In applications depending upon fp64 performance, conditions are not generally favorable to Kaveri. Kaveri's fp64 peak including both the CPU and GPU is only about 110 gflops. You will generally be better off first optimizing your code for AVX and FMA instructions and running on Haswell's CPU cores. If you are using Windows 8, you might also want to explore using Iris Pro through C++ AMP in conjunction with the CPU. Overall I doubt we will see Kaveri being used for fp64 workloads.
For heterogeneous fp32 applications, Kaveri should outperform Haswell GT2 and Ivy Bridge. Haswell GT3e will again be a strong contender on Windows given the extremely capable Haswell CPU cores and Iris Pro graphics. Intel's GPUs do not currently support OpenCL under Linux, but a driver is being worked on. Thus, on Linux, Kaveri will simply win out on fp32 heterogeneous applications. However, even on Windows Haswell GT3e will get strong competiton from Kaveri. While AMD has advantages such as excellent GCN architecture and HSA software stack (when ready) enabling many more applications to take advantage of GPU, Iris Pro will have the eDRAM to potentially provide much improved bandwidth and the backing of strong CPU cores.
I hope I have provided a fair overview of the FP capabilities of each platform. Application performance will of course depend on many more factors. Your questions and comments are welcome.
Post Your CommentPlease log in or sign up to comment.
View All Comments
tipoo - Wednesday, January 22, 2014 - linkI sure hope it's more common with Broadwell. GT3E is a decent performer, I do wish it would make its way to 13" laptops.
Klimax - Wednesday, January 22, 2014 - linkBit older and IIRC unconfirmed:
lefty2 - Wednesday, January 22, 2014 - linkIndeed. The 4770R is only available to OEMs and more, or less unobtainable. Even if you could get your hands on one, you wouldn't want to. Firstly, it comes with a huge price tag, secondly you lose 2M of cache.. that effectively makes it a core i5.
SunLord - Wednesday, January 22, 2014 - linkIt's OEM only because it's totally worthless any other way as it's a BGA only part :(
Shadowmaster625 - Wednesday, January 22, 2014 - linkSo that huge GPU in kaveri cant even outperform an ivy let alone a haswell in terms of fp64/cycle. Why/how is AMD still in business?
jabber - Wednesday, January 22, 2014 - linkOh I can imagine that's always the first question asked when anyone walks into a Best Buy etc. to buy a new PC. I know it keeps me awake at night. Meanwhile back in the real world...
nathanddrews - Wednesday, January 22, 2014 - linkThe key to AMD's success with Kaveri will come on budget mobile notebooks and SFF, where the lack of a dGPU would heavily tilt the gaming advantage to AMD. While Intel HD4000/4600 can game pretty well at 768p, Kaveri would steamroll it and be competent up to 900p... assuming Broadwell IGP doesn't greatly improve.
YuLeven - Wednesday, January 22, 2014 - linkI'm not so sure just yet. I'm hoping for a strong Kaveri on laptops, but past experiences with Llano, Trinnity and Richland showed the clear desktop win from AMD APU's quickly eroding on portable due power constraints.
This year, the gap is much smaller with strong contenders as HD 5000 and HD 5100 in many laptops. I'm not entirely sure about Kaveri's uphand in graphics performance will be large enough to justify the considerable loss of CPU performance and battery life (assuming that Kaveri will perform as poorly against Haswell as it's older brothers did). And then, Kaveri mobile will come just months before Broadwell, which is said to improve GPUs by quite a bit.
PEJUman - Wednesday, January 22, 2014 - linkI have a A6-1450 11.6" laptop that supposed to be a 9W 'SoC' with 30Wh battery.
I also have a i3 ivy bridge 11.6" tablet that supposed to be a 17/14W 'SoC with 54Wh battery.
Expected the A6 to be 60-70% of the i3 battery life based on a light load usage pattern. Got only about 40% of the i3 life.
Rough calculation ended up with around 10W/hour average power consumption on the A6
average power consumption of 5W/hour on the i3.
Considering I only paid $280 for the A6 and $450 for the i3, I am still quite happy with it.
but can't help but wonder if AMD's SDP/TDP is very different compared to intel's.
To my understanding TDP means the max amount of heat you need to dissipate to keep everything running smoothly. Based on that understanding and the 15W power range, you can let the CPU/APU run hotter (thus rejecting 2-3W worth of heat into surrounding pieces: package, motherboard, case, etc) with the same heatsink TDP.
Death666Angel - Wednesday, January 22, 2014 - linkCan't really draw any conclusions from that. The SoC/APU/CPU is usually a very tiny amount of energy draw in modern laptops/tablets. The display accounts for most of the power usage and if there is even a small amount of brightness difference or indeed manufacturer difference, that can account for you scenario easily.