Final Words
The more I think about it, the more I'm confident that the Core i7 continues to fuel Intel's beacon of performance, although admittedly the biggest gains are in well threaded workloads (I will be working on a Hyper Threading/multi-tasking set of tests next). It's not worth the upgrade for most existing Core 2 Quad owners unless you do a lot of video encoding, video editing or 3D rendering, but going forward it looks very likely to continue Intel's performance lead even as AMD brings up its 45nm Phenom processors.
Take power efficiency into account however and then Nehalem gets more interesting to more people. Right now we're only talking about 130W TDP parts, which means that the power efficiency really only applies to someone looking to replace a QX9770. Going forward, when Intel can deliver a 95W, 65W or even lower TDP based on the Nehalem then there may be a compelling power efficiency story. A 10 - 20% decrease in power consumption, at the same manufacturing process, is nothing to scoff at. Then a year from now we get the same architecture built on 32nm, which will hopefully give us an even further reduction in power consumption. It's weird to say, but Nehalem may end up being an incredibly good architecture for notebooks. Keep that in mind before buying those new MacBooks guys.
The power efficiency story gets even more exciting when you realize that these gains come with no change in manufacturing process. Pardon the pun, but the next tick is going to be a cool one.
The overclocking story with Core i7 isn't as complex as it sounded at first, fundamentally you can still clock this thing the way you did the Core 2s before it. Turbo mode and the TDP/current limitations do add some complexities, but with the flip of a BIOS switch they go away if you don't wish to bother with them. Change can be scary, but in this case there's no reason to be worried.
The Core i7 appears to be just as smooth of an overclocker as the Core 2s before it. Increase the BCLK and off you go, free performance from Intel and its wonderful fabs.
The split between the core and the uncore in terms of clock speed and overclocking potential doesn't appear to be that big of a deal either. The uncore runs slower on the lower end chips, but increasing its clock speed doesn't really do all that much for performance. There's a reason Intel kept the uncore running slower than the core and it doesn't look like there's much real world benefit in pushing it much higher.
With Nehalem Intel implemented a lot of changes simultaneously. We got Hyper Threading, a completely static CMOS design, new power gate transistors, QPI, an integrated memory controller and some other lower level architectural tweaks. It's a lot to digest, but we're getting there. To Intel: deliver us some 95W and 65W TDP Nehalem and you'll win the hearts of the current Q6600/Q9300/Q9450 owners.
And I can't wait to see one of these things in a notebook, mobile Nehalem could be the most exciting Centrino launch since Merom...
23 Comments
View All Comments
Denithor - Saturday, November 8, 2008 - link
HT works well on i7 because of two things: software is much more multithreaded today and there have been drastic throughput & memory controller improvements in the generations from Netbust to Nehalem.Multithreaded applications can be accelerated hugely by pulling resources from multiple cores to work on one application (whether physical or virtual cores doesn't matter).
HT on Netbust was like fitting a garden hose onto a fire hydrant. The data just backed up and couldn't feed through the pipe smoothly. On i7 the bandwidth and memory controller have been optimized to improve flow so the cores don't sit idle (HT basically levels the flow of work across the cores so they all stay busy).
TA152H - Saturday, November 8, 2008 - link
Actually, you're probably missing the point that Nehalem is a lot wider than the Pentium 4 was. Consequently for any given clock cycle, you have more execution resources available for two threads that are probably not used, and could be with an additional thread.Most of the time, the data is read from the L1 cache, or, at worst, the L2 cache, so the memory throughput isn't going to be a huge problem most of the time. But, then again, the i7 has a bigger L1 cache, which probably helps as well. It's very slow though, and it makes you wonder why they shackled this processor with a very slow L1 cache (the same clock as a Pentium 4, but with much lower clock speed design). I mean, it can't clock higher than the Penryn, and the cache isn't any bigger, so does it need to be 33% slower? Power savings are nice, but not for a 33% slower L1 cache.
Also, I'm curious why Intel gave up on the Pentium 4 before the 45 nm production. If you think about it, the drastically lower power use of this manufacturing technology would have yielded enormous improvements in clock speed (since the limitation for it was not based on transistor switching speed, but on the power/heat). I don't think there's any doubt they'd be running over 6 GHz, and with some effective tweaks (and undoing some of the Prescott's damager) it might be an interesting processor. Probably not though, but I'm a little curious how it would pan out.
ltcommanderdata - Saturday, November 8, 2008 - link
Yes, I think HT fits well with Nehalem because of the increased execution resources, 3 ALUs, 2 FPUs, and 3 SSE units compared to 3 ALUs and 2 FPU/SSE units in Netburst. Although I think HT serves a different purpose in each design. Netburst didn't have as much memory bandwidth and it's latency was higher so HT served to hide that, while Nehalem has plenty of memory bandwidth and execution resources and HT serves to best take advantage of those resources.In regards to the high cache latency, I have to agree. I have yet to see an explanation of where the high L1 cache latency comes from. And the L2 cache latency is similarly unimpressive considering Dothan had a 2MB L2 cache per core with a 10 cycle latency while Nehalem's 256KB L2 cache per core has higher latency at 11 cycles. Granted that perhaps having a L3 cache forces limitations on the caches, but I still think the latencies are quite high. No offense to the Oregon team, but the last time they did a microarchitecture refresh in Prescott they increased the P4's L1 cache latency from 2 cycles in Northwood to 4 cycles in Prescott and the L2 latency from 16 cycles to 23 cycles so it's disconcerting that they've increased the L1 cache latency from 3 cycles in Penryn to the same 4 cycles in Nehalem, decreased the L2 cache size from 6MB to 256KB to only gain 4 cycles to 11 cycles, and added a 39 cycle L3 cache. I don't think latencies will improve in Westmere, but hopefully they can double the L2 cache to 512KB without increasing latencies and similarly increase the L3 cache, probably to 12MB, without increasing latencies. And maybe latencies can improve in the next microarchitecture refresh in Sandy Bridge with the return of the Israeli team.
And I also agree that the P4 could probably still have hope with the 45nm process. Even at the 65nm process, Presler still had potential. With the Pentiumm Extreme Edition 965, Intel had basically caught up with the power consumption of it's competitor the FX-60. And things actually improved over time, if you looked at the original Presler B1 stepping Intel was only able to reach 3GHz in the 930D at a 95W TDP, while by the last D0 stepping released after Conroe, Presler was able to reach 3.6GHz in the 960D under the same 95W TDP. Under the same process, a 20% increase in clock speed for the same power consumption is impressive for any micro-architecture, and especially Netburst.
Clearly, the 65nm process could have brought Netburst's power consumption under control, but by that time development focus had long already shifted to Merom which is why Presler/Ceder Mill was only a shrink rather than a redesign of Prescott. I guess we'll never know what could have happened if Intel had actually used Presler to correct Prescott's flaws such as reducing cache latency, adding a 2nd instruction decoder to keep the Trace Cache and execution units fed, introducing a native dual core design like Yonah over Dothan, etc. But I think the Merom strategy was in the end better since even with a redesign to improve performance, Netburst would probably always have power consumption on the high-end of acceptible, and would have never been fit for mobile usage which is where consumer focus is shifting.
IntelUser2000 - Saturday, November 8, 2008 - link
Don't complain with the lack of single thread increase. Where do you think the majority of the performance increase in Core 2 came from?? It's not a new idea, it just has better memory parallelism(memory disambiguation, excellent prefetchers).Future IS MULTI-THREAD. Single thread brings minimal performance increase. For gamers who care, GPU does far more than CPU and multi-threading increases things in things that really matter.
Westmere isn't gonna bring large L2 caches, L3 caches will increase but that's because the core count is going to 6 cores. Sandy Bridge will bring per core L2 cache to 512KB, but how much do you think that'll do?? It's at most 5-10%.
The ways to increase x86 CPU performance is decreasing. This is the reason Sandy Bridge will bring advanced Turbo Mode implementation for single threaded performance.
ltcommanderdata - Sunday, November 9, 2008 - link
I wasn't aware that I was complaining about single-threaded performance in my previous posts.And another important thing that Sandy Bridge is bringing is AVX. SIMD doesn't benefit all programs, but it does increase performance of optimized applications regardless of whether they are single-threaded or multi-threaded.
SiXiam - Saturday, November 8, 2008 - link
"The Q9450 can operate at voltages down to 0.85V and as high as 1.3625V, while the Core i7-920 currently appears to be limited to a minimum of around 1.137V."- I just wanted to let everyone know that benchmarkreviews.com got the i7 920 at stock speeds with 1.125volts.
2.66 GHz @ 1.125v 133mhz x20
http://benchmarkreviews.com/index.php?option=com_c...">http://benchmarkreviews.com/index.php?o...Itemid=6...
Denithor - Friday, November 7, 2008 - link
Great article. Very impressive results here, congrats to the i7 design team. Of course, we all said the same thing when C2D was launched, with a much bigger differential in performance/watt versus the "Netbust" architecture.Have you guys tried F@H SMP client on these i7 chips yet? I'm curious how they stack up against the Q9xx0 series in raw performance. Do the multithreading improvements help put CPU folding any closer to GPU folding or will GPU continue to reign supreme?
Does Intel intend to launch dual-core versions of these processors or will this generation be quad only?
Finally, for myself, I have an e8400 and an e3110 which are more than adequate for my current needs. I doubt I will even bother with one of these new setups, I'll just wait until Westmere and the 32nm improvements (higher clocks, lower power, heat and probably price).
Strid - Friday, November 7, 2008 - link
Yeah, I agree. While the offer a solid quad-core performance, and possibly also with a decent energy efficiency, they're not much use for a guy like me who doesn't use much of that multi-core jazz.They might not chew up more watts than QX9770, but QX9770 still is a lot more hungry than even the currently quickest 45 nm dual core (E8600). Any news as to a dual-core'd version of Nehalem yet? I'll stick to my Xeon E3110 until then.
tynopik - Friday, November 7, 2008 - link
> (I will be working on a Hyper Threading/multi-tasking set of tests next).looking forward to it!
(and then the VM tests ;)
cpugeek - Friday, November 7, 2008 - link
I think anandtech fail to mention about QPI vs FSB. QPI is super power hungry and offset a lot of power reduction done by Intel. Thats why Lynfield/clarkfield will be much better power efficient since they didn't use QPI physical layer to talk with chipset/tylesburg.