Random problems associated with small geometries
There were many alarms being rung, when features sizes were around 90 nm, about how smaller geometries were going to create a new set of issues. Today we hear increasing announcements about 20-nm designs being successful but I wonder if there is trouble around the corner, or if clever design can avoid the problems. I am talking about random particles in either the production process or in actual usage of the devices. There have been a couple of articles that I have read recently dealing with these issues. The first is titled “Critical Area Analysis and Memory Redundancy“, written by Simon Favre of Mentor Graphics. He introduces the problem in this way:
Critical Area Analysis (CAA) is a DFM technique that measures the susceptibility of a specific layout to random defects and indicates areas of the layout where design modifications can have the greatest positive impact on overall yield. One way to improve yield of an SOC design with embedded memory is to increase the layout spacing in some areas to achieve a better CAA score. Another way is to build redundancy into the memory design so faulty cells can be bypassed during final production test. Of course, redundancy also has a cost in terms of real estate. So deciding how much to employ DFM techniques versus adding more redundant cells is an engineering optimization problem.
The second paper is titled “Single event effects (SEEs) in FPGAs, ASICs, and processors, part I: impact and analysis” written by Dagan White of Xilinx. He introduces the problem this way:
All sub-micron integrated electronics devices are susceptible to Single-event effects (SEEs) to some degree. The effects can range from transients causing logical errors, to upsets changing data, to destructive single-event latch-up (SEL). Traditionally, FPGAs were targeted as being more sensitive due to their use of SRAM for the configuration storage. As dimensions shrink to below 90 nm, SEEs in all devices, including ASICs, FPGAs, and application-specific standard products (ASSPs) must be considered.
SEEs result from interaction of high-energy particles with circuit elements in integrated circuits. When a high-energy particle passes through the silicon substrate of a device, charged particles are created as the result of sub-atomic particle collisions. These particles are generated by an ionization trail along the path of the incoming particle.
If a charged particle impacts at or near a transistor junction, the collected charge can induce an upset to the state of that transistor. If the collected charge is larger than the critical charge of the element, the element changes state. This change in state (or bit flip, in the case of a memory cell) is referred to as an SEU. Similarly, the charged particles can induce a current and voltage spike on a metal interconnect, which is referred to as a single-event transient (SET). If the pulse width of the spike is wide enough, the spike can propagate through the circuit.
It is clear that additional redundancy is ultimately going to be necessary in all electronic devices in order to avoid random errors, either during the production or while the device is in use. While most attention is being applied to safety critical applications and environments (such as aerospace), it would be very disconcerting to have the words in this article randomly change as I write them. Yes - I can provide some of that redundancy in terms of proof reading, but what if my applications, operating system or other software were to randomly change? A failure of the operating system would be a critical failure to me and I think just about all of the software ever written assumes that the processor does not make mistakes. If that fundamental premise changes, then the migration to multicore may be the smaller of the worries for the software industry. At what point does the increase in redundancy required offset the additional number of transistors that can be placed on a chip in order to remain at the same probability of random failure?
przemek klosowski commented:
Single event upsets were always with us---it's just that the computer usage patterns historically masked it. We used to reboot (or crash) computers often enough so that the errors were papered over.
I always marvelled at the skill of the computer industry in conditioning its customers to just shrug off regular crashes. If cars broke down with the same frequency as computers crashed, there'd be throngs with pitchforks in Detroit.
Linux ran into this early on---look at the number of hardware workarounds and blacklists in the kernel, designed specifically to compensate for brokenness in various hardware components. This brokenness often was not recognized or accounted for in Windows drivers because it was mitigated sufficiently by frequent resets. Seriously, Linux had a significant role in making hardware better because it exposed flaws in the ways that Windows didn't, and shamed many manufacturers into shipping a better product.
In the late 90s I designed a data acquisition setup that collected data for days or weeks into large memory arrays, and we had to use explicit sofware-based row-column checksums because we kept getting single bit upsets.
Nowadays of course even Windows stays up for long time, across hibernations and sleeps; hardware is better, but memory is both bigger and more vulnerable due to smaller feature geometry.















