How to use ECC to protect embedded memories
Increasing memory density, system-on-chip (SoC) memory content, performance, and technology-scaling combined with reduced voltages increases the probability of multi-bit transient errors. Notably, transient errors are no longer restricted to aerospace applications. Now applications such as biomedical, automotive, networking, and high-end computing are susceptible to transient errors and have a need for high reliability and safety.
Transient error sources are, in many cases, self-inflicted because alpha particles are commonly generated in materials adjacent to the chip, solders, and in the packaging. Due to the higher susceptibility to multiple-bit (multi-bit) transient errors, and an increasing requirement for high reliability, there is a greater need to mitigate transient errors in embedded memories. In this article we discuss transient error detection and correction methods using advanced error correction code (ECC) based solutions for embedded memories in order to meet the requirements of today’s high-reliability applications.
Transient or soft errors are functional errors resulting from strikes by energetic ions such as neutrons and alpha particles. They are random in nature and typically lead to data corruption or cause electronic systems to crash. For less critical applications, transient errors are eclipsed by more common issues and can be fixed by resetting or rewriting the device, and generally the time required for resetting or rewriting and bringing the device back to its normal operation is acceptable to users.
However, for critical applications such as networking, transient errors can be catastrophic. Just relying on the reset strategy for transient error mitigation can be very expensive, as the system will be unavailable during the length of the reset or cycle time. This delay might not be acceptable given that some of these mission-critical systems require 99.999% availability.
In addition to disruptions in high availability, transient memory errors can cause security vulnerabilities. Since transient errors have been around and causing electronic systems to fail for years, JEDEC JESD89A was defined to standardize the requirements and procedures for soft-error-rate testing of integrated circuits and reporting of results. However, the options to take any corrective action based on the testing for errors after a design is complete are limited.
Transient errors, like a lot of other design issues, are very costly to address as an afterthought. They can be proactively handled in a much less expensive and more effective manner. In fact, it is significantly more advantageous to address transient errors during the design phase.
As shown in Figure 1, embedded memories dominate the SoC area, making transient error mitigation for SRAMs crucial for high reliability. Once a transient error causes the bit stored in a storage cell to flip, there is no mechanism to recover the bit other than to explicitly rewrite the value or correct the errors while reading out. The most effective solution to address transient errors for SRAMs is to use traditional ECC, such as Hamming codes. ECC allows data that is being accessed to be checked for errors and corrected on the fly. It differs from the basic parity-checking in which errors are only detected and not corrected.