The Role of Intelligent Design in the Evolution of Multi-Core Processors
February 28 2006
Multi-core is the new megahertz. Suppliers whose technology roadmaps a few years ago led inexorably to "10GHz by the end of the decade" now boast of plans to incorporate "ten to hundreds of execution engines" within a decade. Because multi-core designs usually outperform single-core processors based on similar microarchitectures, they require more memory and I/O bandwidth than their single-core counterparts; the old laws of "balanced system design" still apply. Systems that lack the necessary bandwidths will of necessity deliver sub-optimal results. System architecture still matters.
Earlier this month, Intel demonstrated an early version of Clovertown, a quad-core processor it hopes will allow it to "Leap Ahead" of AMD in the server performance battle that began in earnest in 2003. Although the company coyly refuses to indicate what's inside the package, industry rumors suggest Clovertown includes a pair of Woodcrests, Intel's Next Generation Microarchitecture (iNGMa) server processor. Some question whether an ad hoc multi-chip approach can legitimately claim the "quad-core" label, a debate Insight 64 chooses to avoid. On the one hand, it's clear there are four cores residing in each CPU socket. It hardly matters whether these four cores reside on a (large) piece of silicon or two smaller die in a multi-chip package. On the other hand, the two processors behave like two dual-core processors in a two-socket Woodcrest platform, although their performance will be a tad slower due to constraints on the speed of the front-side bus that links all four cores to the rest of the system. In the remainder of this note, Insight 64 compares and contrasts the architecture of Intel's ad hoc multi-core processors that (from our perspective) were thrown together to back up marketing collateral, with Intel's own more advanced designs that display the influence intelligent designers applied in pursuit of superior system performance. For the sake of completeness, we'll also look at AMD's approach to dual- and quad-core technology.
Intel's First Dual-core Designs – Short Time-to-Market but Weak Performance
Intel's initial entries in the dual-core arena – Smithfield, Paxville, Dempsey and Presler – all take a pragmatic, ad hoc approach to dual-core integration. These chips' packages contain two independent processors, sometimes located on a single large die, and other times on two smaller dice. In all these designs, neither core realizes the other lies in close physical proximity. Communication between the cores must be accomplished over the external front side bus that connects both cores to the north bridge. Figure 1 illustrates the most highly developed of these approaches – Dempsey with the Blackford chipset – that provides a separate front-side bus for each CPU socket in the system. This simplistic approach to dual-core architecture greatly shortened Intel's development schedules, and allowed the company to bring its initial dual-core products to market in just nine months.
Intel Dual-Core (Dempsey) with Blackford Chipset
Although straightforward from a design standpoint, Intel's initial dual-core processors suffer from two key deficiencies. First, the front-side bus, already a performance inhibitor in single-core systems, now must be shared by both cores. Second, inter-processor cache snooping, a feature that ensures the coherency of cached data, must be performed over the front-side bus, just as it is in the case of discrete dual-processor systems. Thus cache snooping places an incremental load on the FSB, which in turn constrains the performance of cache snooping and degrades memory access latency. Intel attempts to mitigate the issues stemming from its use of a front-side bus by increasing the cache size of its processors, an approach that sometimes helps and sometimes hinders performance.
Yonah Gets It Right
While Intel's engineers in Santa Clara and Oregon floundered with their ad hoc dual-core approaches, the engineers in Intel's Israeli Design Center (IDC) had the time and the budget to pursue a more refined and effective approach. Yonah, their initial dual-core chip as illustrated in Figure 2, duplicates most elements of the Dothan processor from which it was derived, but shares caches and system interface functions across both cores. Since Yonah's two cores share a common L2 cache, they never need to go off-chip to ensure cache coherency. Better still, when both cores execute threads of a single application, the data either core pulls into the cache can be utilized without additional effort by the other. Yonah eliminates most of the performance-inhibiting front-side bus contention issues common to Smithfield, Paxville, Presler and Dempsey, since the two cores share a front-side bus interface that serves both cores' requirements. This also simplifies bus loading characteristics from a circuit design perspective and facilitates higher speed FSB operation. Although Insight 64 regards front-side bus interfaces as an archaic design concept, Yonah's FSB interface certainly makes the best of a bad situation.
Intel's Dual-Core (Yonah) Mobile Processor
Although it's a bit early to discuss Intel's Next Generation Microarchitecture (iNGMa) processors (Merom and Conroe), it's safe to assume these chips will inherit Yonah's approach to dual-core architecture and will feature even more inter-core sharing than the earlier Yonah processor. Intel designed Woodcrest, its iNGMA for two way servers, so that OEMs can "drop" it into platforms originally designed for Dempsey, its ad hoc dual-core CPU, as shown in Figure 3. Given Blackford's dual-independent front-side buses, and Woodcrest's shared bus interface, Intel has room to increase the FSB speed over the corresponding Dempsey-based platforms. This is about as good as it gets in Intel's FSB-based dual- and multi-processor server architectures.
Intel Dual-Core (Woodcrest) Server Processor with Blackford Chipset
Enter Clovertown from Stage Left
Just when we thought Intel had finally kicked its ad hoc, quick and dirty approach to multi-core processor design, the company began to beat the drums for Clovertown, its "next generation" quad-core server processor. Intel hasn't said whether it architected Clovertown as a quad-core processor, or merely combined a pair dual-core chips in a multi-chip package, a la its Dempsey (et al) dual-core designs. At a similar point in its dual-core disclosures a few years back, Intel refused to comment on the internal design and architecture of those chips, which Insight 64 then surmised (and events later confirmed) took an ad hoc approach. Based on the timing of the recent drumbeats, we believe that Clovertown will consist of two Woodcrest dice crammed into a single package, as illustrated in Figure 4. We've seen this movie before, and it didn't have a happy ending then, either. Quad-core processors demand more memory bandwidth than dual-core designs, but the ad hoc design of Clovertown asks an FSB slower than the one that feeds the dual-core Woodcrest to feed four cores instead of two. The label (and the press release) will say "quad-core," but the performance will likely be only marginally better than a dual-core Woodcrest in most applications. Clovertown will resurrect all the problems (cache snooping over the front-side bus, contention for FSB access, an overworked memory controller) common to Intel's earlier ad hoc dual-core designs. As they meander through Intel's fabs, in-process Woodcrest dice will worry that they may end up starving for bandwidth in a Clovertown package, while their more fortunate brethren will go into simpler dual-core packages and receive a far richer diet of data. Although we doubt there will be an outcry to end this inhumane treatment of processor chips, we're reasonably sure these ad hoc quad-core devices won't be all that they can be.
Intel Quad-Core (Clovertown) Server Processor with Blackford Chipset
Opteron Got It Right the First Time
Unlike Intel's multi-year dalliance with hyper-pipelining and ultra high clock frequencies as the path to increased processor performance, AMD's CPU architects identified the potential for multi-core processors as they designed their first generation Opteron and Athlon64 processors. Those initial single-core chips, introduced in 2003, contained features that simplified AMD's move to dual-core in 2005, and allowed both cores to share the on-board memory controller and HyperTransport links, as shown in Figure 5. AMD's system partners accomplished the move from single to dual-core systems by merely "dropping" dual-core Opterons into the sockets previously designated for single-core processors, a far simpler move than the gymnastics Intel's system OEMs had to undertake in their move to dual-core Xeon systems.
AMD Dual-Core Opteron, circa 2005
Insight 64 does not anticipate the release of AMD quad-core processors prior to 2007, but the company has already laid the groundwork for that offering. This year, for the first time since the 2003 Opteron launch, AMD plans to revise the socket design used for its server processors. The new so-called "Socket F" used in the motherboards OEMs roll out this year for "F Step" dual-core Opterons also contains the features AMD needs to support the quad-core Opterons included in its 2007 roadmap, as shown in Figure 6. Just as in the dual-core designs, all the cores in AMD's architected quad-core chip will share access to the memory controller and HT links. Cache snooping operations that involve only the cores on a single chip can be handled on chip without any external off-chip traffic. Since cache snooping traffic increases exponentially with the number of cores in a system, this aspect of AMD's design will be even more important in its quad-core CPUs than in the current dual-core models.
AMD Quad-Core Opteron, circa 2007
Why Does Intel Continue to Pursue Ad Hoc Multi-Core Processor Designs?
Only a psychoanalyst would be fully qualified to analyze Intel's behavior and offer opinions in this regard. Since we have never been certified in this regard , we can only comment with regard to the results of Intel's approach. First, we acknowledge that these ad hoc approaches deliver more performance than a single chip (in the same package), even if the actual performance falls far below what a more intelligently designed chip might deliver. Thus Intel customers in need of more performance than the current Intel line-up provides will find the newer ones to be helpful. Second, Intel gains technology leadership points if it beats AMD to market with a quad-core processor, regardless of the underlying technology or performance of that processor. Insight 64 doubts, however, whether IT professionals will be as easily taken in as consumers in Best Buy by hollow technology claims that won't deliver meaningful performance improvements.
We are confident that eventually Intel will introduce quad-core processors with competitive performance, but we doubt that Clovertown will be the vehicle that meets this target. The next generation Woodcrest and Clovertown processors will clearly allow Intel to narrow the performance gap between it and AMD, but until Intel comes up with an architected quad-core processor with a scalable memory system, we doubt that the company will surpass AMD's two- and four-socket performance metrics.
i. Paul Otellini's IDF Keynote Address, August 28, 2001
ii. Intel's Justin Ratner, Feb. 10, 2006, cited in Xbit Labs News, http://www.xbitlabs.com/news/cpu/display/20060212234634.html
iii.The author wishes to thank Sun Microsystems for highlighting the wonderful irony of citing "intelligent design" (in the high tech context) as a positive attribute.
iv.Intel's Jonathan Douglas, the engineer in charge of the Smithfield project, noted the need for a rapid development program in a presentation he gave at an industry event in August 2005. "With the realization that its single-core processors had hit a wall, Intel engineers plunged headlong into designing the Smithfield dual-core chip in 2004 but faced numerous challenges in getting that chip to market." … "One reason for the aggressive schedule set for Smithfield was the need to respond to AMD's actions, Douglas said, without mentioning his company's competitor by name. "We needed a competitive response. We were behind," he said."
v.It helps when the program and data fit entirely in the cache, thus eliminating main memory latency problems. Some benchmarks rely on this feature, but few real-world programs exhibit such behavior. It hurts when both cores work on shared data, a situation that increases the likelihood that the data one core needs will be in the other core's cache, thus forcing more inter-cache transfers over the slow front-side bus.
vi.Intel also incorporated a variety of techniques to increase the power efficiency of the shared L2 cache, but those are beyond the scope of this note.
vii.Intel also sells a version of Yonah it calls Sossaman in dual-socket server configurations, where coherency checking between the L2 caches of each processor must be handled over the FSB, with all the performance drawbacks inherent in Intel's earlier single-core DP, and ad hoc dual-core systems.
viii.We'll be able to say more after Intel's Developer Forum in a few weeks.
ix.They will also inherit Yonah's archaic front-side bus. That's the problem with inheritance; you can inherit your rich uncle's lack of hair along with his bank account.
x.This diagram uses a photo of Yonah as a proxy for Woodcrest. In the Woodcrest photo, the caches consume a far greater proportion of the total chip area.
xi.See Insight 64: Fall IDF: Megahertz is Dead, Long Live Dual-Core, 9/15/2004
xii.Insight 64 based the CPU images used in this illustration on an extrapolation from AMD's current dual-core design. AMD's actual quad-core CPU will likely differ in one or more regards from those shown here.