System (Un)Reliability: The Microsoft Xbox 360 Case Study
In yesterday morning’s weekly EDN editorial teleconference, we were discussing (among other things) Mike Santarini’s upcoming cover story on IC reliability in the March 6 issue. The topic of the Xbox 360 and its various reliability woes came up, in part (I suspect) because Mike’s a contented owner of this particular console as of Christmas and may be wondering if (and if so, when) he’s going to need to tap into the system’s recently extended three-year warranty.
Coincidentally, several particularly interesting Xbox 360 reliability-related bits recently found their way into my RSS reader, which I’ll pass along for your Friday-and-weekend inspection.
- The Seattle Post-Intelligencer recently snagged an interview with an anonymous Microsoft ‘insider’ (more commentary from DailyTech), who passed along some detailed (albeit, of course, officially unsubstantiated) information on the console’s RROD (red ring of death) debacles. As I’ve also suggested several times in the past, the insider reveals that Microsoft made a conscious decision to rush the Xbox 360 to market, which resulted in an all-important one-year-plus availability lead in this particular round of the console wars. However, by (for example) relying on 90 nm-fabricated ICs versus waiting for (potentially) cooler-running 65 nm chips, Microsoft is now dealing with customers’ system failures. I should point out that none of my Xbox 360s has (yet) had problems, even though I acquired one of my systems shortly after the November 2005 Zero Hour launch event. Granted, I’m not much of a gamer, but I’ve played plenty of CPU-intensive HD DVD titles on the system…
- …however, the CPU isn’t the system’s primary Achilles’ Heel, as my system teardown from last year pointed out, and as the always-excellent Andrew ‘bunnie’ Huang also suggests in his recently-published blog entry. The ‘weakest link‘ is the GPU, specifically the solder joints that connect it to the PCB and the progressive degradation of those joints as high internal temperatures cause the PCB to flex. The culprit here is coefficient of thermal expansion mismatch between the IC package and PCB, a concept that I showcased in the last introduction I managed prior to my departure from Intel (the late-1996 µBGA flash memory package). The photos accompanying Huang’s writeup graphically communicate the result of this mismatch, along with showing evidence of trapped-gas ‘voiding’ that occurred during initial soldering as part of the system manufacturing flow. In Huang’s particular case the red ink-based ‘dye and pry’ analysis didn’t reveal any flat-out failed solder joints, but the degraded links he discovered certainly could combine with other system compromises to increase the likelihood of breakdown…a statistical probability that of course increases over usage time.
- In my particular case, I’m not sure of what percentage of the overall HD DVD decoding-and-rendering flow runs in software on the CPU, versus being hardware-accelerated on the GPU, so it’s not clear how much my systems’ GPU solder joints are being stressed. And pragmatically, anecdotal evidence suggests to me that only a small percentage of the Xbox 360 system failures are ‘infant mortality’ (a term that causes me to cringe every time I type it) in nature. By the time most consumers’ consoles exhibit a RROD, the owners have already amassed a library of game titles, effectively ‘hooking’ them. Shipping in a system for replacement with someone else’s refurb is a frustrating hassle, granted, and the reputation damage may hurt Microsoft in the next round of the console wars, but for now Microsoft can rest assured that the lucrative revenue flow from its installed console customer base will in most cases continue unabated.