//Meta tags

Saturday, March 18, 2006

IDF Marks the End of Many Eras

Nathan Brookwood
March 14, 2006

Last week's Intel Developer Forum, the 18th to be held in the US since the first 1997 event where Gordon Moore delivered the keynote, marked several turning points with regard to Intel's role in the electronics industry:

• It marked the end of the Howard High era. Howard has been the face of Intel's PR department since as far back as we can remember, but last week he announced his retirement from Intel. He will be missed, although his departure hardly merits a press release.

• It marked the end of the NetBurst microarchitecture era. NetBurst, the technology behind Intel's Pentium 4 and Xeon lines for the last five years, burst onto the scene in November 2000. The original 1.5GHz model barely outperformed the 1.0GHz Pentium III it superseded. Follow-on versions at ever higher frequencies tended to deliver more heat than performance. NetBurst, more than any other Intel technology, created the competitive opportunity that allowed AMD to claim performance and performance/watt leadership for the first time in its history. We expect neither a press release nor an obituary noting the passing of NetBurst to issue from Intel's PR department.

• It marked the end of the Pentium® era. Intel launched its first Pentium 13 years ago, after a court ruled it could not trademark simple numbers like "486" and "586." Although Intel substantially changed Pentium's underlying architecture three times, and shrunk the size of its transistors from 800nm to 65nm, each new version proudly bore some variation of the Pentium name. Now Intel believes the Pentium brand has become too closely aligned with Megahertz for Megahertz's sake. Intel retagged its mobile processors with the new "Core™" brand earlier this year, and plans to extend that brand to desktops with the Q3 launch of chips based on its new Core™ microarchitecture. R.I.P. Pentium®, but don't look for a press release.

• It marked the end of Intel's x86 microprocessor monopoly era. Of course, Intel never referred to itself as a monopolist; if pressed, the company might agree it had "a dominant" segment share. Intel always maintained that it remained vulnerable to competitive challenges. The past year has demonstrated it was correct in its characterization of the market; AMD challenged and Intel was vulnerable. Perhaps Intel's legal department will want to issue a press release.

At IDF Intel showed that it's ready to put its massive engine back on the track. Now that AMD has had a chance to sip from the fountain of technology leadership, will Intel's new products restore the status quo ante? Will AMD willingly resume its traditional role as a provider of low cost, low performance processors for personal computers aimed at value-oriented buyers? We suspect not. In the remainder of this note, we outline the reasoning behind our conclusion.

A Brief History of the x86 Processor Market

From 1980, when Intel crushed its competition (Motorola) to win the microprocessor socket in the original IBM PC until very recently, Intel took no prisoners in its relentless campaign to gain market leadership. The company out-marketed, out-designed and out-manufactured all its so-called competitors. Those competitors demonstrated incredible inventiveness in finding new ways to mess up their own businesses. They under-invested, over-promised and under-delivered; even when they (rarely) achieved a performance advantage, they proved unable to maintain that advantage for more than a quarter or two. The results were about as predictable as a football game between the Pittsburgh Steelers and your local high school team.

Over the past decade, one Intel competitor, AMD, slowly began to get its act together. It assembled a team of expert CPU designers and gave them the time they needed to create an extremely competitive product. It partnered with leading suppliers (first Motorola, later IBM) to augment its manufacturing process technology, and put in place the capacity to supply 20 percent of the processors the market would need. It made a few astute technology bets that worked out in its favor, including SOI process technology, on-board memory controllers and multi-core technology as the path to increased system performance. It has executed with nary a hiccup for almost five years. When Intel's NetBurst problems began to manifest themselves in 2003 and 2004, AMD was well positioned to capitalize on Intel's misery and seize the mantle of technology leadership.

Back to the Present

At last week's IDF show, Intel took the wraps off the next-generation design it has dubbed "Core™ Microarchitecture." By any measure, the architects in Intel's Israeli Design Center (IDC) have created a superb design that puts Intel back on the competitive landscape. It's likely that in desktop and notebook segments, Intel may leap ahead of AMD, at least by a skosh, with regard to performance and power consumption. In two-way servers, the Q3 launch of Woodcrest should allow Intel to achieve parity with AMD on thermal and power metrics; at that point each company will win some benchmarks and lose some, based on the behavior of specific programs. It will take longer for Intel to find its way out of its four-way server morass, given that it cannot shed its NetBurst legacy in this segment until late in 2007.

While Intel has busied itself retooling its roadmap and realigning its architecture to fit with advancing semiconductor process technology, AMD has been building its reputation, one chip at a time. It has now emerged as a credible alternative to Intel in the markets both companies serve. The market has shifted from a monopoly with one credible supplier and one wannabe to a duopoly with two credible suppliers. This should hardly come as a surprise. Most markets with high entry barriers eventually evolve into oligopolistic configurations; Intel's ability to maintain its dominant position for as long as it has demonstrates just how high the barriers to entry are in the markets it serves. But now that the walls have been breached, we see little likelihood that the market will revert to its former arrangement. Customers (i.e., system OEMs) prefer a choice of suppliers, be it in regards to graphics (ATI and Nvidia), LAN interfaces (Broadcom, Intel and Marvell), wireless adapters (Atheros, Broadcom and Intel), and a variety of other components. These products, unlike DRAM modules or disk drives, cannot be freely substituted for one another, but they provide similar functions. OEMs choose one or another supplier for a variety of reasons that don't always devolve to "highest performance" or "lowest price." We anticipate that each company will at times pull ahead of the other, only to see the other pull ahead at a later date. This strikes us as the way markets are supposed to operate. The New England Patriots did not make it to Super Bowl XXXX, but they may very well end up as a contender in Super Bowl XXXXI.

Price War? We Don't Think So
Although both Intel and AMD lower prices periodically to make room in their product line-ups for new, faster processors, one or two Wall Street analysts always seem to interpret these moves as signs of an impending price war. The fab capacity Intel and AMD plan to add this year buttresses that scenario. We beg to differ. In markets with perfect competition (i.e., few barriers to entry or exit, and many competitors) prices tend to track the marginal cost of production. Add in a few barriers to exit, and prices can even fall below marginal cost, as happens from time to time in the DRAM industry. The competition in the x86 processor segment is far from perfect in this regard. The segment has high barriers to entry and exit, and only two suppliers. Duopolies often end up in a Nash Equilibrium, a state first described by John Nash, the Nobel Laureate portrayed in "A Beautiful Mind." Nash observed that firms in duopolistic markets typically set production levels that maximize their own profits. Each firm understands its competitor's strategy and adapts its own accordingly. The end result is that the combined output of both producers is greater than the output of a single monopolistic producer, but less than a large number of producers operating independently would deliver. Conversely, the overall price level in a duopoly is lower than a monopolist would charge, but higher than the prices determined by a perfectly competitive market. Thus the combined profit for duopolists will be less than that of a monopolistic supplier, and some of the profit a monopolist would have obtained gets redistributed to the customers of the duopolists. This may account for the somewhat lopsided distribution of profits in the PC industry to date. Historically, Intel and Microsoft have done very well in that environment, while PC supplier margins have varied between thin and non-existent. We view the increasing profitability of HP's PC business as a sign this redistribution may already have begun.

[1] The PR department will need to designate a replacement who can utter the phrase "OK guys, you know the drill. Raise your hands and wait for the mike runner to come to you," with the same aplomb as Howard.
[2] Intel dubbed the internal effort to win IBM's PC business "Operation Crush."
[3] AMD alleges that other factors may be at work, but that's a subject for a different note.

Fin de Siecle 2006 Insight 64 Nathan Brookwood

Wednesday, March 15, 2006

The Role of Intelligent Design in the Evolution of Multi-Core Processors
Nathan Brookwood
February 28 2006

Multi-core is the new megahertz. Suppliers whose technology roadmaps a few years ago led inexorably to "10GHz by the end of the decade" now boast of plans to incorporate "ten to hundreds of execution engines" within a decade. Because multi-core designs usually outperform single-core processors based on similar microarchitectures, they require more memory and I/O bandwidth than their single-core counterparts; the old laws of "balanced system design" still apply. Systems that lack the necessary bandwidths will of necessity deliver sub-optimal results. System architecture still matters.

Earlier this month, Intel demonstrated an early version of Clovertown, a quad-core processor it hopes will allow it to "Leap Ahead" of AMD in the server performance battle that began in earnest in 2003. Although the company coyly refuses to indicate what's inside the package, industry rumors suggest Clovertown includes a pair of Woodcrests, Intel's Next Generation Microarchitecture (iNGMa) server processor. Some question whether an ad hoc multi-chip approach can legitimately claim the "quad-core" label, a debate Insight 64 chooses to avoid. On the one hand, it's clear there are four cores residing in each CPU socket. It hardly matters whether these four cores reside on a (large) piece of silicon or two smaller die in a multi-chip package. On the other hand, the two processors behave like two dual-core processors in a two-socket Woodcrest platform, although their performance will be a tad slower due to constraints on the speed of the front-side bus that links all four cores to the rest of the system. In the remainder of this note, Insight 64 compares and contrasts the architecture of Intel's ad hoc multi-core processors that (from our perspective) were thrown together to back up marketing collateral, with Intel's own more advanced designs that display the influence intelligent designers applied in pursuit of superior system performance. For the sake of completeness, we'll also look at AMD's approach to dual- and quad-core technology.

Intel's First Dual-core Designs – Short Time-to-Market but Weak Performance

Intel's initial entries in the dual-core arena – Smithfield, Paxville, Dempsey and Presler – all take a pragmatic, ad hoc approach to dual-core integration. These chips' packages contain two independent processors, sometimes located on a single large die, and other times on two smaller dice. In all these designs, neither core realizes the other lies in close physical proximity. Communication between the cores must be accomplished over the external front side bus that connects both cores to the north bridge. Figure 1 illustrates the most highly developed of these approaches – Dempsey with the Blackford chipset – that provides a separate front-side bus for each CPU socket in the system. This simplistic approach to dual-core architecture greatly shortened Intel's development schedules, and allowed the company to bring its initial dual-core products to market in just nine months.

Figure 1
Intel Dual-Core (Dempsey) with Blackford Chipset

Although straightforward from a design standpoint, Intel's initial dual-core processors suffer from two key deficiencies. First, the front-side bus, already a performance inhibitor in single-core systems, now must be shared by both cores. Second, inter-processor cache snooping, a feature that ensures the coherency of cached data, must be performed over the front-side bus, just as it is in the case of discrete dual-processor systems. Thus cache snooping places an incremental load on the FSB, which in turn constrains the performance of cache snooping and degrades memory access latency. Intel attempts to mitigate the issues stemming from its use of a front-side bus by increasing the cache size of its processors, an approach that sometimes helps and sometimes hinders performance.

Yonah Gets It Right
While Intel's engineers in Santa Clara and Oregon floundered with their ad hoc dual-core approaches, the engineers in Intel's Israeli Design Center (IDC) had the time and the budget to pursue a more refined and effective approach. Yonah, their initial dual-core chip as illustrated in Figure 2, duplicates most elements of the Dothan processor from which it was derived, but shares caches and system interface functions across both cores. Since Yonah's two cores share a common L2 cache, they never need to go off-chip to ensure cache coherency. Better still, when both cores execute threads of a single application, the data either core pulls into the cache can be utilized without additional effort by the other. Yonah eliminates most of the performance-inhibiting front-side bus contention issues common to Smithfield, Paxville, Presler and Dempsey, since the two cores share a front-side bus interface that serves both cores' requirements. This also simplifies bus loading characteristics from a circuit design perspective and facilitates higher speed FSB operation. Although Insight 64 regards front-side bus interfaces as an archaic design concept, Yonah's FSB interface certainly makes the best of a bad situation.

Figure 2
Intel's Dual-Core (Yonah) Mobile Processor

Although it's a bit early to discuss Intel's Next Generation Microarchitecture (iNGMa) processors (Merom and Conroe), it's safe to assume these chips will inherit Yonah's approach to dual-core architecture and will feature even more inter-core sharing than the earlier Yonah processor. Intel designed Woodcrest, its iNGMA for two way servers, so that OEMs can "drop" it into platforms originally designed for Dempsey, its ad hoc dual-core CPU, as shown in Figure 3. Given Blackford's dual-independent front-side buses, and Woodcrest's shared bus interface, Intel has room to increase the FSB speed over the corresponding Dempsey-based platforms. This is about as good as it gets in Intel's FSB-based dual- and multi-processor server architectures.

Figure 3
Intel Dual-Core (Woodcrest) Server Processor with Blackford Chipset

Enter Clovertown from Stage Left

Just when we thought Intel had finally kicked its ad hoc, quick and dirty approach to multi-core processor design, the company began to beat the drums for Clovertown, its "next generation" quad-core server processor. Intel hasn't said whether it architected Clovertown as a quad-core processor, or merely combined a pair dual-core chips in a multi-chip package, a la its Dempsey (et al) dual-core designs. At a similar point in its dual-core disclosures a few years back, Intel refused to comment on the internal design and architecture of those chips, which Insight 64 then surmised (and events later confirmed) took an ad hoc approach. Based on the timing of the recent drumbeats, we believe that Clovertown will consist of two Woodcrest dice crammed into a single package, as illustrated in Figure 4. We've seen this movie before, and it didn't have a happy ending then, either. Quad-core processors demand more memory bandwidth than dual-core designs, but the ad hoc design of Clovertown asks an FSB slower than the one that feeds the dual-core Woodcrest to feed four cores instead of two. The label (and the press release) will say "quad-core," but the performance will likely be only marginally better than a dual-core Woodcrest in most applications. Clovertown will resurrect all the problems (cache snooping over the front-side bus, contention for FSB access, an overworked memory controller) common to Intel's earlier ad hoc dual-core designs. As they meander through Intel's fabs, in-process Woodcrest dice will worry that they may end up starving for bandwidth in a Clovertown package, while their more fortunate brethren will go into simpler dual-core packages and receive a far richer diet of data. Although we doubt there will be an outcry to end this inhumane treatment of processor chips, we're reasonably sure these ad hoc quad-core devices won't be all that they can be.

Figure 4
Intel Quad-Core (Clovertown) Server Processor with Blackford Chipset

Opteron Got It Right the First Time
Unlike Intel's multi-year dalliance with hyper-pipelining and ultra high clock frequencies as the path to increased processor performance, AMD's CPU architects identified the potential for multi-core processors as they designed their first generation Opteron and Athlon64 processors. Those initial single-core chips, introduced in 2003, contained features that simplified AMD's move to dual-core in 2005, and allowed both cores to share the on-board memory controller and HyperTransport links, as shown in Figure 5. AMD's system partners accomplished the move from single to dual-core systems by merely "dropping" dual-core Opterons into the sockets previously designated for single-core processors, a far simpler move than the gymnastics Intel's system OEMs had to undertake in their move to dual-core Xeon systems.
Figure 5
AMD Dual-Core Opteron, circa 2005

Insight 64 does not anticipate the release of AMD quad-core processors prior to 2007, but the company has already laid the groundwork for that offering. This year, for the first time since the 2003 Opteron launch, AMD plans to revise the socket design used for its server processors. The new so-called "Socket F" used in the motherboards OEMs roll out this year for "F Step" dual-core Opterons also contains the features AMD needs to support the quad-core Opterons included in its 2007 roadmap, as shown in Figure 6. Just as in the dual-core designs, all the cores in AMD's architected quad-core chip will share access to the memory controller and HT links. Cache snooping operations that involve only the cores on a single chip can be handled on chip without any external off-chip traffic. Since cache snooping traffic increases exponentially with the number of cores in a system, this aspect of AMD's design will be even more important in its quad-core CPUs than in the current dual-core models.

Figure 6
AMD Quad-Core Opteron, circa 2007

Why Does Intel Continue to Pursue Ad Hoc Multi-Core Processor Designs?

Only a psychoanalyst would be fully qualified to analyze Intel's behavior and offer opinions in this regard. Since we have never been certified in this regard , we can only comment with regard to the results of Intel's approach. First, we acknowledge that these ad hoc approaches deliver more performance than a single chip (in the same package), even if the actual performance falls far below what a more intelligently designed chip might deliver. Thus Intel customers in need of more performance than the current Intel line-up provides will find the newer ones to be helpful. Second, Intel gains technology leadership points if it beats AMD to market with a quad-core processor, regardless of the underlying technology or performance of that processor. Insight 64 doubts, however, whether IT professionals will be as easily taken in as consumers in Best Buy by hollow technology claims that won't deliver meaningful performance improvements.

We are confident that eventually Intel will introduce quad-core processors with competitive performance, but we doubt that Clovertown will be the vehicle that meets this target. The next generation Woodcrest and Clovertown processors will clearly allow Intel to narrow the performance gap between it and AMD, but until Intel comes up with an architected quad-core processor with a scalable memory system, we doubt that the company will surpass AMD's two- and four-socket performance metrics.

i. Paul Otellini's IDF Keynote Address, August 28, 2001
ii. Intel's Justin Ratner, Feb. 10, 2006, cited in Xbit Labs News, http://www.xbitlabs.com/news/cpu/display/20060212234634.html
iii.The author wishes to thank Sun Microsystems for highlighting the wonderful irony of citing "intelligent design" (in the high tech context) as a positive attribute.
iv.Intel's Jonathan Douglas, the engineer in charge of the Smithfield project, noted the need for a rapid development program in a presentation he gave at an industry event in August 2005. "With the realization that its single-core processors had hit a wall, Intel engineers plunged headlong into designing the Smithfield dual-core chip in 2004 but faced numerous challenges in getting that chip to market." … "One reason for the aggressive schedule set for Smithfield was the need to respond to AMD's actions, Douglas said, without mentioning his company's competitor by name. "We needed a competitive response. We were behind," he said."
v.It helps when the program and data fit entirely in the cache, thus eliminating main memory latency problems. Some benchmarks rely on this feature, but few real-world programs exhibit such behavior. It hurts when both cores work on shared data, a situation that increases the likelihood that the data one core needs will be in the other core's cache, thus forcing more inter-cache transfers over the slow front-side bus.
vi.Intel also incorporated a variety of techniques to increase the power efficiency of the shared L2 cache, but those are beyond the scope of this note.
vii.Intel also sells a version of Yonah it calls Sossaman in dual-socket server configurations, where coherency checking between the L2 caches of each processor must be handled over the FSB, with all the performance drawbacks inherent in Intel's earlier single-core DP, and ad hoc dual-core systems.
viii.We'll be able to say more after Intel's Developer Forum in a few weeks.
ix.They will also inherit Yonah's archaic front-side bus. That's the problem with inheritance; you can inherit your rich uncle's lack of hair along with his bank account.
x.This diagram uses a photo of Yonah as a proxy for Woodcrest. In the Woodcrest photo, the caches consume a far greater proportion of the total chip area.
xi.See Insight 64: Fall IDF: Megahertz is Dead, Long Live Dual-Core, 9/15/2004
xii.Insight 64 based the CPU images used in this illustration on an extrapolation from AMD's current dual-core design. AMD's actual quad-core CPU will likely differ in one or more regards from those shown here.

Intel Inside, Apple Outside

Nathan Brookwood
January 16, 2006

Steve Jobs used last week's MacWorld conference to launch the first Intel-based Macintosh systems, although none of the new machines bear any external signs ("Intel Inside," "Core™ Duo" or "VIIV") that reveal their silicon underpinnings. Apple gives users their choice of Core Duo processors running at either 1.83GHz or 2.0GHz, mated with 17-inch and 20-inch displays, at prices starting at $1,299 for the smaller screen model and $1,799 for those based on the larger screen. The systems look just like, and are priced just like the G5-based iMacs Apple has been selling, but they offer two to three times the performance of the earlier G5-based models. The company also introduced the awkwardly-named MacBook Pro, a new notebook that comes with either a 1.6GHz or a 1.83GHz Core Duo and a 15.4-inch display. Apple prices the new notebook like the $1,999 15-inch G4-based PowerBook it replaces, and claims the new Intel-based design outperforms the old G4 design by a factor of four to five. Given the barely competitive performance of those aging G4 notebooks, this probably wasn't hard to achieve.

Smoothing the Software Transition
Apple released a fully-ported, native x86 version of its OS X (Tiger) operating system, along with native x86 versions of its key Macintosh applications: iPhoto, iMovie, iDVD, GarageBand, iLife, iWork, iWeb and Aperture. In this regard, its move from PowerPC (PPC) to x86 differs dramatically from the 68K to PowerPC transition it undertook in 1992. When it launched its initial PowerMac offerings, large portions of the OS and most applications ran only in 68K emulation mode. Apple's Xcode software development environment plays a key role in smoothing this transition and making it a bit less painful than the earlier ones. The company extended Xcode to support x86 processors, thus allowing Apple's developers, as well as those working for key third-party software development partners, to stay within a familiar environment as they adapted their programs to run on x86 processors. Apple intends to deliver so-called "universal versions" of its software that support both PowerPC and x86 architectures, so the installer can always match the proper binaries with the architecture of the machine on which it is running. It's providing a low-cost ($49) "crossgrade" package that allows users of earlier versions of Apple's proprietary software to get the latest x86 versions. Hopefully, ISV's will follow suit, and provide affordable mechanisms for customers to get x86-native versions of their packages as well.

Of course, some PowerPC Mac software may never get ported to x86. Most PPC applications running on OS X can use an emulation/translation facility Apple calls "Rosetta," based on Transitive Corp.'s QuickTransit technology, to run on the new x86 Macs. Rosetta works well with non-CPU-intensive applications like word processors and web browsers, but bogs down when it encounters compute-intensive tasks like PhotoShop or Premier. Microsoft currently supports its Mac application suite via Rosetta, and plans to deliver native x86 versions this spring. Adobe, Apple's other key ISV, has been less forthcoming regarding its x86 plans; users who make extensive use of PhotoShop or Premier should probably delay MacIntel purchases until this situation clears up. Apple supports Rosetta only under OS X; Mac users who have not yet migrated from MacOS 8 or MacOS 9 have no easy way to move their applications to MacIntel configurations, and will be consigned permanently to increasingly obsolete PowerPC platforms. Consider these users as the collateral damage created by Apple's latest architectural transition.

Reliance on Intel Chipsets Reduces Apple's Hardware Development Expense
The new iMacs and MacBooks incorporate the processor formerly known as Yonah, Intel's latest dual-core mobile chip. The non-portable iMac benefits from the Core Duo's low heat dissipation, an important issue given the iMac's cramped interior. Both systems rely on the same Intel core logic (i.e., chipsets) used in the latest Windows-based "Napa" platforms Dell, HP, Lenovo (and countless others) released at the start of the month. In addition to reducing Apple's time to market with new technologies, the move from PPC to x86 saves Apple millions in R&D expense by getting the company out of the core logic development business.

No 64-Bit Tiger X86 Support Until Later This Year
Unfortunately, Intel's new Core Duo lacks 64-bit extensions, precluding Mac users from tapping the 64-bit capabilities in the Tiger OS. This should not bother MacBook users, since Apple's G4based notebooks also lack 64-bit capability, but iMac G5 users, who had just gotten a taste of 64-bittedness, now will find themselves relegated back to the land of 32 bits. As usual, Apple refuses to comment on its future product roadmaps. Insight 64 anticipates that when Intel releases its next generation 64-bit Core Duo processor (Merom) later this year, Apple will likely refresh the current products to add 64-bits, and extend the line with x86-based versions of the PowerMac tower and Xserve configurations, thus completing the PPC to x86 transition quicker than most had anticipated when Apple first announced its Intel strategy last spring.

Windows Can't Run On MacIntel Platforms; OS X Can't Run On Non-MacIntel Platforms
Following Apple's disclosure of its x86 strategy, many hoped the new Apple boxes would have the ability to run Windows as well as OS X. Although true Macolytes question the sanity of anyone who would taint Apple's sacred hardware with the Devil's operating system, those who are trapped in a Windows world (including this author) but admire Apple's craftsmanship and industrial design, certainly viewed Windows on MacIntel as an attractive alternative. Alas, no such option will be forthcoming. Although Apple now uses standard Intel processors and chipsets, standard ATI Radeon graphics controllers, and standard USB, SATA and PCI-Express buses, the new Macs differ from Windows-based PC's in one key regard, firmware. Since the launch of the very first 8088-based IBM PC in 1981, DOS and Windows-based systems have relied on BIOS (Basic Input/Output System) firmware to boot the system and start the OS. BIOS firmware has evolved greatly in function, but lacks flexibility and (some feel) needlessly complicates PC support issues. For several years, Intel has encouraged its PC OEM customers to adopt a more enlightened firmware environment dubbed the "Extensible Firmware Interface (EFI)," but inertia keeps BIOS in the driver's seat. Apple, with no x86 legacy concerns, opted for the newer, more flexible EFI approach. Tiger uses EFI to start up on x86 machines, and since Wintel boxes lack EFI, they cannot run Tiger. Conversely, Windows needs BIOS to get started, and since MacIntel boxes lack BIOS, they cannot run Windows. These different firmware environments will separate MacOS and Windows environments almost as effectively as instruction set architecture did when Macintosh software ran only on PowerPC chips.

The Impact on Apple's Market Share
When Apple signaled its new strategy, Insight 64 feared that a difficult and awkward software transition to x86 might drive some Apple customers away from the platform, into the willing arms of Wintel suppliers. Given the aplomb with which the migration appears to be progressing, those fears have largely abated at this point. Although Apple's move to x86 may not result in wholesale defections from its installed base, it remains to be seen whether these new platforms can pry Windows users away from XP. Insight 64 remains skeptical in this regard. Regardless of how easy or hard a software interface may be to master, once it has been assimilated, moving to a different environment takes more work than staying with the old one. Given the vast improvements in price/performance enabled by its shift to x86, Apple may be able to attract more new users (i.e., those who have yet to fill a disk drive with files and programs that cannot move easily to an incompatible system) than has been the case over the past few years. But, barring a major disaster with the upcoming Windows Vista launch, we would be greatly surprised if Apple's shift to x86 resulted in a meaningful increase in the company's market share.

Apple Outside
Few ingredient branding programs have ever achieved the success of the "Intel Inside" campaign Intel put into practice in 1991. Over the years, Intel has spent more than $3B to subsidize OEM marketing programs that promote their own brands, along with Intel's. (That little five note "Intel sound" you hear whenever you see a Dell, HP, IBM or Lenovo ad on TV means Intel has paid roughly half of the cost of the ad.) Given the thin margins on which Wintel PC suppliers operate, these Intel Inside subsidies can greatly impact the OEM's profit picture. Whatever doubts we had about whether Steve Jobs would be tempted by the pots of gold available via the Intel Inside program were erased when we laid eyes on the new iMacs and MacBooks. MDF (Market Development Funds) notwithstanding, Apple has no intent to deploy any brand but its own on its products or in its advertising. They may have switched to Intel inside, but they're still Apple on the outside.

Intel Inside, Apple Outside © 2006, Insight 64

Monday, March 13, 2006

Yen and the Art of Microprocessor Design

Yen and the Art of Microprocessor Design

(Sun Launches Its Sun Fire T1000/2000)

Nathan Brookwood
December 7, 2005

Yesterday, a few months earlier than it had projected three years ago, Sun launched its first systems based on its UltraSPARC T1 (nee Niagara). These systems, the 1U, $2,995 T1000 and the 2U, $7,795 T2000, contain only a single processor chip, but that chip contains eight CPU cores, each with four threads, and looks to the software like 32 discrete (single-core) CPUs. The new Sun systems handily outperform comparably configured, albeit more expensive Power 5+, Xeon and Itanium configurations on standard benchmarks (SPECweb2005, SPECjAppServer2004, and SPECjbb2005, along with others). But performance is only half the story; the T1000 and T2000 consume far less power and dissipate far less heat than those Power 5+, Xeon and Itanium systems. This gives Sun a dramatic advantage in terms of performance per Watt, a metric of increasing importance to IT managers struggling to handle growing workloads in datacenters that have already maxed out their power and HVAC resources.1 Sun's systems beat competitive offerings by factors of four or five on this key metric, a truly stunning improvement over the prior state of the art. If these new systems cannot reignite growth of Sun's SPARC-based systems business, it's hard for Insight 64 to imagine what could.

Sun's advantages all stem from the architecture of its new UltraSPARC T1 processor. Most processor architects agree that DRAM latency (the time it takes to move data from the system's main memory to its on-chip caches) is usually the principal impediment to improved system performance. It takes anywhere from 50 to 100 nanoseconds for a typical DRAM subsystem to deliver data to the CPU, and during this interval, a 2GHz processor like Intel's Xeon or AMD's Opteron loses the opportunity to execute 200 to 400 machine instructions. CPU designers employ a variety of techniques to minimize the impact of cache misses. Large on-chip caches reduce the likelihood of a miss somewhat, but consume large amounts of chip real estate and add to cost. Out-of-order execution (OOO) allows a CPU to process instructions that don't depend on missing data while stalled instructions wait for the data they need to trickle in from memory. This approach complicates CPU design and provides at best a partial solution, since the most advanced CPUs can juggle about 100 instructions in this manner, but the DRAM delays force the loss of 200 to 400 instruction execution slots. Sun's engineers identified (correctly, in our opinion) that the mismatch between CPU and DRAM speed would only worsen over time, and that an entirely different approach was needed.

David Yen, the former CPU architect who now heads Sun's SPARC systems business, decided to set out in an entirely new and radical direction2. Rather than fighting DRAM latency, he designed a chip that accepts latency as a fact of life, and operates well in spite of latency. Instead of adding esoteric features to minimize the impact of cache misses, his design merely switches to a ready to run thread whenever an executing thread stalls due to a cache miss. The processor works on this second thread until it too stalls, and then switches to a third, and even a fourth thread. As long as at least one of the four threads remains unblocked, the CPU does useful work that directly contributes to the final calculations. This in turn simplifies the overall execution pipeline, and eliminates the need for fancy branch prediction and OOO hardware. The chip operates on the philosophy that "if we stall, we stall; there's always something else to do." This in turn allows Sun to shrink the size of each core and allows them to fit more cores on the chip (eight) than any other general purpose processor manufactured on a 90nm process. This in turn allows Sun to run their CPU at the lowly speed of 1.2GHz and reduces the chip's power requirements. The T1 processors announced today consume 70 Watts, a little more than half the power needed to run the fastest Xeons in Intel's line. (Sun missed its original 60W power target, but few will notice and the product still beats all its competitors in this regard.)

Fast processors tend to run slowly unless they are mated with memory subsystems that can provide the bandwidth needed to feed the processor's voracious appetite for data. To satiate the UltraSPARC T1's appetite, Sun includes four DDR2 DRAM controllers directly on the chip. These controllers can move up to 20GB/second between the DRAM banks and the processor. Memory bandwidth should not constrain this processor in most applications. The on-chip location of these controllers minimizes memory access latency. (Even though the T1 embraces DRAM latency, less is always better in this regard.) The memory controllers connect to the unified 12-way associative on-chip level two cache, which in turn is connected to a crossbar switch that moves data between all eight cores and the cache. All of these pieces fit together as shown below:

Source: Sun

The diagram above highlights one of the new chip's key limitations, namely its lackluster floating point performance. All eight cores share a single floating point execution unit. When any thread in any core encounters a floating point instruction, it schedules that instruction's execution on the FPU, and stalls until the operation completes. The wrong mix of integer and floating point instructions could slow the processor down to a crawl. Sun argues that such sequences occur rarely in the applications they targeted for the T1, but Insight 64 anticipates that it won't take IBM and Intel long to find an optimal mix of integer and floating code that invokes this pathological behavior.

The diagram also highlights a second limitation – this 8 core/32 thread processor lacks the ability to be used in dual- or multi-processor arrangements. Users seeking more than 32 hardware threads in their systems will have to wait for Niagara 2, the 65nm T1 follow-on slated for 2007. That chip will feature 8 cores and 8 threads per core (64 threads in all), and a dual-processor capability that will allow up to 128 cache-coherent threads in a system.

We're (obviously) impressed with the approach Sun took with its T1-based systems, and with the performance and performance per watt results they published yesterday. Before Sun embarked on its throughput computing adventure, we (like many others) questioned whether it made sense for Sun to develop its own processors. Its UltraSPARC line offered few advantages with regard to performance or price/performance. Their chips' primary virtue was the ability to run Solaris without the need to recompile and/or reacquire software applications (a painful process many Macintosh users will be forced to undergo when Apple moves from PowerPC to x86 processors next year). The launch of Sun's new products demonstrates that there is still a place for innovation in the systems business, and that customers are best served when many suppliers are allowed to bring their unique perspectives to the market. It's unlikely that an Intel or an AMD would have pursued the kind of extreme multi-core/multi-thread approach Sun pursued, since those suppliers try to leverage their designs across both client and server markets; the fit between fat clients and 32-thread processors remains an unknown at this point.
Now that Sun has brought its new products to market, it remains to be seen whether prospective customers will adopt them. We're confident that Sun's installed base will find them irresistible, but will the company be able to sway customers that have grown used to purchasing x86-based systems from a multitude of system suppliers? Although Insight 64 generally views the rise of (so called) industry-standard based systems as an ineluctable force, we also accept that systems based on proprietary approaches can have merit, especially when there is no industry-standard alternative. We believe the advantages offered by Sun's new approach are sufficiently compelling that buyers should set aside their understandable bias toward systems based on industry standards and carefully evaluate these new Sun systems.

1 Paul Otellini recently indicated that Intel believed performance per watt would become a key system purchasing criterion, and asserted that Intel intended to lead the industry with regard to this metric.
2 Although this report cites Yen as the chip's designer, in practice a large team of Sun engineers, including a cadre added via its Afara Websystems acquisition, slaved over this design for more than two years. Yen, however, was smart enough to recognize the idea's merit, and brave enough to support it organizationally although it flew in the face of the conventional wisdom regarding CPU design.

Yen and the Art of Microprocessor DesignPage 2 © 2005, Insight 64