EDN Access

 

November 6, 1997


In picking ECCs, the key is
bit-error
location--not rate

Tom Waschura, SyntheSys Research

Bit-error-rate testing is a well-established measure of digital-data-transmission quality. Even so, the location, rather than the number, of errors in the data stream provides the information you need to choose among the many ECC strategies.

Error-correction-coding theory describes ways of adding overhead to transmitted or stored messages so that the messages are understandable upon reception or playback. Challenging implementations of error-correction codes (ECCs) underlie many advances in computer technology and digital communications--from high-density disk drives, to cellular-telephone and pager networks, to deep-space exploration.

Digital-communications designers' increased use of error-correction techniques has created a demand for tools to help evaluate and exercise coding approaches. Communications-system designers use such tools to try new design approaches before implementing them in expensive ASICs or firmware. By studying the exact locations of bit errors as they occur in the underlying digital channel, a designer can evaluate coding performance before adding error-correction layers. Using real-time analyzers is no more difficult than using familiar bit-error-rate (BER) testers. The results allow rapid creation of efficient systems.

Nearly all systems that transmit or store digital data use some form of error control. System designers must face the certainty that real-world phenomena alter some stored or transmitted bits. The choices of how to handle errors range from doing nothing to using elaborate error-detection and -correction methods. Selecting among the choices depends on the information's accuracy, speed, and latency requirements and whether simultaneous bidirectional communication exists between the sender and the receiver.

An application's intended use of data determines how many errors in the data an applic ation can tolerate. Data files stored in computer systems can usually tolerate one error in 1013 bits--a BER of 1×10­13--equivalent to one error in 1250 Gbytes. Most applications that send instrumentation-sensor data across a link can tolerate BERs of 1×10­9. A data stream whose BER is 1×10­7 contains fewer errors than would be troublesome in a typical broadcast digital-video playback. In PCM digitized-voice communications, a BER of 1×10­4 can provide acceptable perform-ance. In some cases, random errors are more tolerable than burst errors, so that systems with higher BERs appear to exhibit higher quality.

23MS2531The application also determines data-transfer speed and latency requirements. Processing small transactions, such as database accesses or wireless-network communications, requires low latency be-tween the time when data arrives and when it is available for use. If error correction requires postprocessing, the data is unavailable until the postprocessing is complete. Applications that record data onto tape can involve a high degree of latency. A large latency allows spreading ECC words through more data and generally improves a system's ability to correct burst errors. This improvement comes at the expense of significant buffering and delay. If error bursts are a problem, one solution is to use a random-error-correcting coder/decoder with a suitable interleaver and deinterleaver (Figure 1).

23MS2532The encoder's output sequence is interleaved before transmission and deinterleaved before decoding, thus more uniformly distributing the errors at the decoder input. The interleaver deterministically rearranges or permutes the order of a sequence of symbols. The deinterleaver reverses the process to restore the original order (Figure 2). The two main types of interleavers are periodic and pseudorandom. Periodic interleavers are simpler. They produce an interleaving permutation that repeats at regular intervals.

The pseudorandom interleaver is more complex but is a better choice when the channel's burst characteristics can vary substantially. Spread-spectrum antijamming systems exemplify applications of this type. This interleaver takes a block of L symbols after encoding and pseudorandomly reorders them. One implementation sequentially writes the L symbols into RAM and reads them out in pseudorandom order. You can store the permutation in ROM and use the ROM information to address the interleaver memory.

The data rate of a communication channel can restrict the practical error-correction approach. Streaming applications must keep up with real-time data. Depending on the error-distribution statistics and the level of hardware support for error correction, certain error-correction algorithms operate too slowly to keep up with the data rate. Digital channels implemented in firmware have much less error-correction horsepower than hardware modems or controllers with built-in, highly integrated error-correction chips.

In some cases, bidirectional communication between a sender and receiver can eliminate the need for forward-error-correction systems. Bidirectional communication allows for creating systems that achieve low BERs by using powerful and simple error detection that asks the transmitter to resend incorrectly received data. This concept is the basis of many network protocols. This same practice can also work in magnetic recording systems that verify the data they write while they are writing it. Errors detected during writing can trigger the replacement of erroneous data blocks in anticipation of trouble during the eventual read back.

The first step in designing an error-handling approach is to identify the requirements for the final channel's error statistics. You should specify not only the BER, but also the error-burst profiles. Vendors have delivered systems that meet the BER specification but fail to operate as intended, because they inadequately handle isolated error bursts.

The next step is to characterize the raw link's expected underlying error characteristics. You should completely understand how the system behaves when you subject it to the anticipated causes of failure. In a disk drive, look at the channel when the head is not centered directly over the track. In a satellite-communication application, look at error statistics obtained during rainstorms, snowstorms, and periods of high sunspot activity. In designing error-correction systems, characterizing the channel's raw-error performance is much more demanding than making a simple BER measurement. Burst statistics and the relationship between bit and burst errors and eventual data-blocking factors are important.

Beyond the basic BER, useful digital-channel-error analysis includes a comprehensive study of the location of the link's bit errors. You can analyze bit-error locations using histograms of burst length and intervals between errors, error autocorrelation, and block-oriented error distributions.

Errors in a digital channel often result from several phenomena. For example, it is common to have random -noise-induced errors along with burst errors, such as errors that result from lightning or media defects. Combating the different error components can require different approaches, so it is important to isolate types of errors. Once you gather statistics and define the system requirements, you must match error distributions and processing requirements to error-correction approaches.

Many practical error-correction architectures exist. These architectures include maximum likelihood detectors (for example, the Viterbi decoder), fire codes, Reed-Solomon block codes, Hamming codes, and CRC error detection with retransmission. Along with block-error correctors, it is common to include data shuffling, or interleaving, to scramble burst-error distributions among many ECC words.

Maximum-likelihood detectors usually operate early in bit detection to help make bit- or symbol-threshold decisions. These decisions are based on knowledge of the system characteristics and the bit patterns that are most likely to appear. In a magnetic recorder, for example, two neighboring bit transitions of a disk-drive, partial-response, maximum-likelihood waveform cannot both be north poles. (Poles must alternate from north to south.) A detector that sees an S-N-N-N-S sequence correctly decodes an S-N-S-N-S sequence.

Maximum-likelihood detectors greatly affect small, random errors, and, therefore, designers often use them with other methods to achieve the desired error performance. Maximum-likelihood detectors typically use RAM-based look-up tables. This method finds the most likely bit sequence by assigning quality metrics to legal bit-sequence transitions and selecting the highest quality bit sequence.

Fire codes are ideal for rare occasions of small error bursts. A fire code is part of an error-correction strategy that appends a small number of bits (for example, 16 or 32 bits) to a block of data during encoding. This approach enables the receiver to correct a small error burst in each data block. Among fire-code benefits is the ability to use the same checksum both to detect errors and to correct a limited number of errors. A typical 32- bit checksum used in telecommunications and data-storage corrects single-error bursts that are less than 11 bits long. The block length that the fire code covers is also limited in a way that is similar to, but more restrictive than, the limitation on CRCs for error detection.

Fire codes operate because of the careful selection of the CRC-generating polynomial. These special polynomials contain two prime-polynomial factors, each of which helps locate a detected error. Knowing the location of a transmitted error modulo and the degree of each of these prime-polynomial factors, you can apply the Chinese remainder theorem to locate the exact bit error. Although the math sounds complex, you can easily implement it with high-speed D flip-flops and XOR gates.

Reed-Solomon block codes are popular in communications and data-storage applications. Like fire codes, Reed-Solomon-code implementations append symbols to the end of a transmission to locate and correct errors during decoding. Reed-Solomon-code systems' effectiveness at high data rates results from operations taking place at the code-symbol rate or at a fixed number of times per code word. Either way, the number of operations is much smaller than the number of bits. Chips that implement these types of high-speed real-time correctors are commercially available, as are DSP-software options.

Generally, a Reed-Solomon corrector provides a number of symbol corrections, say T, in an N-symbol code word. T is independent of the location of the errors inside the code word. When you use this method with complex interleaving, this approach can easily correct large error bursts. Further algorithm refinements allow even more correction capability if you know the error location by some other means. These "soft-error" indicators can come, for example, from up-stream decoding violations.

RAM subsystems often use Hamming codes. By using small code words, these codes offer simple decoding and correction at high speeds. Hamming codes typically ensure detection in fixed-length strings of multiple-bit errors.

CRC error detection computes the remainder of a polynomial division of a generator polynomial into a message. The remainder, which is usually 16 or 32 bits, is then appended to the message. When another remainder is computed, a nonzero value indicates an error. Depending on the generator polynomial's size, the process can fail in several ways, however. Use of the CRC technique for error correction normally requires the ability to send retransmission requests back to the data source.

Interleaving is a simple way to spread dense errors that would otherwise overload code words that can correct small errors. Overloading means asking the code word to correct more errors than the chosen number of overhead symbols are designed to correct. Interleaving introduces latency, however, because the technique requires the availability of enough data to fill an interleave buffer before encoding or decoding can begin. Interleaving also imposes restrictions on data granularity. Data must be available in chunks that are at least the size of a full interleave buffer.

Communications and data-storage products use 2- and 3-D interleaving to correct error bursts of thousands of bytes in real time. For random erasure bursts with a bounded maximum burst length, a periodic interleaver is better. For longer bursts, even with sizable burst-length variations, a periodic interleaver matched to the nominal burst length is effective--provided that the duty cycle remains constant. When the nature of the interference is uncertain, random interleaving offers greater robustness.

Each of these error-correction approaches has overload characteristics that cause failures. For example, if a 32-bit fire-code-protected block contains two errors separated by exactly 11 bits, the fire code fails to correct the errors. A Reed-Solomon code word is unable to both locate and correct errors in a block that has more than T symbol errors. However, if the error locations are already known, the code word could make corrections.

The location of errors with respect to each other and to symbol and block boundaries is important. To choose the correction strength or the interleave depth of advanced correctors, you must identify the number and correlation of random and burst errors. Even the distribution of burst events becomes important if the events correlate. An error-correction system that aims at withstanding the typical observed burst size will probably fail if one burst predicts another burst that falls within the same code-word interleave. An example is the correlated bursts that occur when a magnetic read head bounces off a large defect in a recording medium.

Analysis techniques used in designing and evaluating error-correction systems include measuring BER and burst-error rates, burst-length histograms, error autocorrelation, block-error distributions, and error-free-interval histograms. All of these analyses rely on knowing the exact bit location of a channel's errors.

Identifying the number of isolated bit and burst errors is an important first step. The best burst-error correctors can be ineffective with an overload of background errors. Isolated errors might be pattern-dependent, in which case signal equalization and coding might improve the BER. If all errors are isolated and random, there is no need to interleave the data, because the errors are already randomly distributed with respect to the correction-code words.

Determining burst errors depends on categorizing neighboring errors into one burst event. Grouping errors into bursts requires some tolerance for the number of error-free bits within a burst. Thus, common instruments allow users to set the maximum error-free-interval length in bits. For example, with fire codes, you should artificially set the maximum error-free interval to the data's blocking length. This setting ensures that two successive errors in a data block appear as one burst.

23MS2533Once you identify the bursts, a histogram can show the probability distribution of different-length bursts. However, you can't use the histogram alone for designing error correctors, because you do not yet know where any two bursts are with respect to each other. Burst-length histograms typically reflect the underlying physics of the digital channel. Factors such as lightning-strike duration, magnetic-head shapes, and defect sizes contribute to the burst length. Figure 3 shows a histogram of burst lengths in a channel that already has a 1632-bit interleave and a very large minimum-error-free-interval setting. Set up this way, the histogram shows many bursts with lengths of 4000 to 5000 bits, indicating that raw, preinterleaved errors are five symbols long.

23MS2534Error autocorrelation shows how errors correlate with each other. This analysis answers the question of whether one error predicts another error. The first part of an error autocorrelation looks similar to a burst-length histogram. The more interesting relationships occur farther from the initial error and help define interleave depths. For example, in Figure 4's sample autocorrelation from a data-storage channel, a high correlation exists between errors separated by 34,848 bits. This is due to defects inherent in magnetic-medium manufacturing. Process improvements are unlikely to remove these defects, so error correctors must deal with them. Interleave depths should separate errors by at least 34,848 bits to place these highly correlated error events in different code words.

Block-error distributions are histograms of the number of errors that occur within a user-programmed block size independent of the errors' location within the block. This analysis is useful on position-independent correctors, such as Reed-Solomon correctors. A block-error analysis can help you choose the correction strength, T, to correct each block. The Reed-Solomon code does not correct block-histogram entries that are larger than the selected T value.

Using commercial digital-channel-error analyzers, you can evaluate error-correction strategies by emulating interleave depths and correction operations. For an emulation to succeed, you must know the exact positions of the channel's error bits, and you must plot the errors in a table that mirrors the behavior of the blocking and interleaving. Once the table is full, you can check the limits of the corrector to evaluate which errors would be removed and which would cause overloads.

Error-correction emulation results in a modified stream of error locations that you can study in the same way you would study an uncorrected stream. You can then determine postcorrection error rates, burst lengths, and correlation. You can even apply additional levels of error correction.

Equipment to perform this type of analysis connects into systems the same way as do older BER testers. You can send pseudorandom data patterns or more complex formatted data streams through the analyzer's evaluation channels and route the channel outputs to the error analyzer. The analyzer synchronizes with the data stream and then locates error positions. You can also supply marker information that determines the blocking factors, which the analysis uses.

You can often interface a digital-channel-error analyzer to a system on which traditional BER testers have proven unuseful. Factors that cause difficulty applying BER testers include the analyzers' use of bit-serial interfaces and reliance on pseudorandom data patterns. New digital-channel-error analyzers offer serial and parallel interfaces with user-programmable patterns. Patterns can include sections for equalization training, PLL detection, synchronization, and more.

Analysis can occur either in real time or as a postprocessing step after logging error data. Postprocessing of error-data sets allows you to evaluate the effect of different correction interleaves and strengths on the same raw-error data. This apples-to-apples comparison holds the raw-channel errors constant and varies only the correction parameters.

When you design modern error-correction systems, the location of errors and their relationships to each other and to the symbol or code-word boundaries are more important than the BER. Analyzers that study error locations are as easy to use as earlier BER testers and provide insight into a digital channel's behavior.


 

Author's biography

Tom Waschura is a principal engineer with SyntheSys Research (Menlo Park, CA). His interest is in digital-channel design and analysis, in which he holds several pending and assigned patents.


| EDN Access | Feedback | Table of Contents |


Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc.