EDN Access -- 08.17.95 When computers must not fail . . . EDN logo


Cover Story: August 17, 1995

When Computers must not fail . . .

Dan Strassberg,
Senior Technical Editor

When a computer failure can take down an industrial process or a whole company, computer systems that keep running command high prices. Newer software and hardware are cutting the price premium, though--and none too soon; demand is exploding for computers that just won't quit.

Industrial Issue

Some computer systems simply must not fail. More and more, computers that handle real-time industrial applications have to stay on-line at all times. At present, the Holy Grail of computer-system uptime is "five nines" (99.999%) availability--about five minutes of downtime per year. Because not all mission-critical applications need or can afford such low downtime, many ways to increase uptime have emerged. Although most schemes use special hardware, software figures at least as strongly in preventing and recovering from system crashes. Within certain limits, software now makes it possible to build fault-tolerant systems from standard workstations and PCs.


Picture One

Suppliers of so-called fault-tolerant or continuous-availability products like to focus on uptime. Although this orientation puts a positive spin on downtime, it can obscure major differences among products. Availability of 99.9% sounds a lot like 99.99%. But a system that delivers 99.99% availability is 10 times as reliable as one that offers 99.9% and usually commands a higher price. A less expensive class of products goes by the name "high availability." Few products that go by this name have been around long enough to establish track records for reliability. Thus, in the high-availability area, vendors focus on features that enhance systems' reliability.

Manufacturers of hardware-based fault-tolerant systems (IMP, Sequoia, Stratus, and Tandem, for example) do quantify uptime. These companies have supplied continuous-availability products for at least a decade. These systems' primary applications are in telecommunications and on-line transaction processing (OLTP). Airline-reservation and brokerage systems are two examples. Prices start at under $100,000, although configurations that cost from $200,000 to $500,000 are more common. Before paying such prices, customers must calculate a payback, and documented reliability is essential for the calculations.

However, you don't need to spend $100,000--or even half as much--to obtain reliability far greater than that of standard workstations and PCs. Several vendors of industrial PCs and VMEbus products (see box, "For free information...") offer systems that duplicate the hardware elements that are most likely to fail--power supplies, hard disks, and blowers. Some systems also continuously monitor key indicators of impending failure, such as power-supply voltages, airflow, air temperature, and fan rotation. These systems warn operators of conditions that can foretell failures. Thus, high-availability industrial PCs provide a measure of fault tolerance at a fraction of the cost of true continuous-availability systems.


Picture Two

Hot swapping of key system elements is another feature of some of these PC and VMEbus products. A hot-swappable system element is one that you can unplug or install without powering the system down and, ideally, without disturbing system operation. In addition to many hard-disk drives used in RAIDs (Ref 1), the components that are currently farthest along in hot-swappability are power supplies and fans.

Hot-swappable I/O controllers and CPUs--standard features of true fault-tolerant systems--are virtually nonexistent in PCs. You can buy VMEbus products that permit hot swapping, however. Currently, you achieve hot swapping of VME boards by using interposers--short extenders that fit between standard VME boards and the backplane. The VME-64 bus, whose specification is currently under review pending ratification, will eliminate interposers.

VME-64 provides hot swappability as a standard feature, thanks in part to special five-row DIN connectors supplied by Harting Elektronik. The connectors assure that ground connections are the first made and the last broken when you insert or remove a board. Specially developed bus-interface ICs safeguard against "glitching" the bus with transients when you insert or remove a board. The ICs, currently available from Texas Instruments as the ETL series, will also be available from National Semiconductor.

Another feature of these ICs is incident-wave switching. This feature increases the bus speed by assuring that signals cross logic thresholds immediately as they propagate down the bus. Older technology required a signal reflection from the end of the bus to bring the voltage level above the threshold.

Despite their interest in high-availability computing, some manufacturers of industrial PCs and VMEbus products question how soon (or even whether) the documented low downtime of continuously available OLTP systems will come to industrial applications. The reason is that typical industrial systems dedicate a high percentage of their hardware to handling real-time I/O.

Although OLTP systems also handle lots of I/O, the nature of the I/O devices is different. Operator terminals are the most common I/O devices in OLTP. In industrial applications, contact closures, ADCs, and DACs are the most common. Whereas duplicating these system elements is technically feasible, wiring complexities and cost have deterred manufacturers from offering redundant real-time I/O hardware.


Picture Three

Not everyone thinks fault-tolerant computing is impractical for areas such as factory automation, however. For example, Isis Distributed Systems supplies software tools for creating fault-tolerant applications that run on networks of standard workstations. The company points with pride to a new Advanced Micro Devices wafer-fab facility as a showcase for the advantages of using Isis' tools to develop manufacturing-control applications.

Saying that software plays a key role in system availability is a clear understatement. Whereas every high- and continuous-availability system de-pends on redundant hardware in some form, software problems can bring the most sophisticated hardware to a dead stop.

Although vendors of fault-tolerant hardware address many potential software problems in their operating-system (OS) software, a new type of software tool aims at creating systems that survive an even wider range of faults. These tools create continuously available systems by networking standard workstations and PCs. Because such networks use mass-produced hardware, they can be less expensive than systems based on specialized hardware. (The networks are more expensive than non-fault-tolerant and even high-availability approaches, however.) In general, too, these networks take longer to recover from faults than systems based on hardware fault tolerance.

As with more conventional fault-tolerant systems (ones with tightly coupled redundant hardware elements), the objective of fault-tolerant networks is to eliminate single points of failure (SPFs). When viewed in the proper light, groups of standard workstations running specialized software that synchronizes communication among the group members contain no SPFs.

"Synchronizes" isn't a totally accurate term because fault-tolerant networks don't require messages to arrive at all destinations simultaneously. Synchrony in a fault-tolerant group of networked computers requires that no group member act on any message until it knows that the message has reached all members. Each group member must acknowledge to its peers within a specified period that it has received a message. If group members can't confirm that some member has received a message, they isolate the suspect member, divide its tasks among themselves, and roll back the image of system activity that each maintains to what existed when all group members last functioned normally.

In theory, the group members can be widely dispersed geographically; they might even communicate via the Internet. (A group linked by the Internet would not operate in anything even remotely resembling real time, however.) In practice, the group members are rarely more than a few miles apart. Even then, system architects make sure that group members share as few system elements as possible. For example, group members receive ac power from different ac mains (though usually from the same power company).

Many factors determine how rapidly such a networked group can recover (that is, resume more or less normal operation, even if not at full speed) from the failure of a member (where "member" includes a computer and its associated communications facilities). The speed of communication is especially important. Generally, recovery takes from several seconds to several minutes. Of greater importance, though, is that recovery--in the sense defined here--requires no human intervention.

In contrast, systems that base their fault tolerance mainly on specialized hardware often recover in nanoseconds. However, the types of failures from which hardware-based fault-tolerant systems can recover are somewhat limited. Hardware-based fault tolerance facilitates recovery from deterministic failures; fault tolerance based on message synchrony permits recovery from any kind of failure. Although deterministic problems include many software faults, certain software faults can bring down systems whose fault tolerance depends primarily on hardware. Computer scientists have named these faults Heisenbugs, after the Heisenberg uncertainty principle.


Picture Four

Heisenbugs are elusive faults that appear following particular, seemingly innocuous sequences of events. Because such faults are unpredictable, software engineers cannot eliminate them by design. One way that message-synchronized fault-tolerant networks keep nondeterministic bugs from causing serious problems is by dynamically allocating tasks among group members. The networks allocate tasks in a way that prevents the members from doing quite the same thing at the same time. For true fault tolerance, group members still must check each other's calculations. If results don't agree, the group must determine which result is correct and remove the member that caused the error from service.

Although the concepts of message-synchronized fault-tolerant networks are over a decade old, the networks were impractical until economical workstations that process multiple MIPS became available. Maintaining message synchrony requires a lot of computing power--one estimate is about 0.5 MIPS per computer. Unless each of the networked computers can process significantly more than 0.5 MIPS, inadequate power remains for the real computing tasks.

Isis Distributed Systems provides so-called middleware, which allows application developers to design fault-tolerant networked (client-server) applications with little concern for the impact of fault tolerance on their code. The Isis System Developers' Kits contain libraries of routines that developers bind into their code. These routines handle message synchronization across the network. Development licenses start at $3000; runtime licenses start at $400.

Hardware-based fault-tolerant computing systems are among the most highly sophisticated and impressive computers that exist. Hardware developers are starting to ask whether the disease that these elegant products cure is becoming so rare that the cure is superfluous, however. With or without fault-tolerant designs, the reliability of computer hardware continues to improve steadily. Still, it's hard to imagine a responsible developer of a mission-critical application entrusting such an application to a single conventional workstation or PC.

In general, systems based on hardware fault tolerance include at least two of every major element. Surprisingly, though, some fault-tolerant systems use fewer than twice the number of hard-disk drives that an otherwise-equivalent non-fault-tolerant system would use. RAIDs that contain five or nine drives (vs four or eight for equal-capacity non-fault-tolerant configurations) are examples. Arrays that include redundant drives can correct errors caused by a failure of any single drive. Similarly, some systems use n+1-redundant power supplies. For example, if two supplies can provide all of the system's power needs, the system includes three units.

Although all true fault-tolerant computers contain at least two CPUs, most contain more. When there are just two CPUs, both make the same calculations and compare the results. If the results match, the system assumes all is well, and the work proceeds. If the results don't match, diagnostic routines determine which CPU has malfunctioned. IMP uses this approach. In IMP's implementation, the diagnostics have 100 msec to report a result. If a CPU fails to report its status in that time, the surviving CPU takes it off-line.

LOOKING AHEAD

Although Europe now accounts for some 90% of all smart cards in use, the situation is likely to change. Credit-card giant Visa is starting to issue smart cards in the United States, and Master Card is teaming with Visa in an Internet venture that could provide an expanded role for smart cards.

Factors as seemingly unrelated as just-in-time manufacturing and the declining cost of electronic hardware are coming together to increase the importance and the practicality of nonstop computing. When customers depend on uninterrupted supplies of components and materials, the computers that control the production of those items had better not fail. At the same time, even nonstop-computing approaches that derive their fault tolerance from software use lots of powerful hardware. The declining cost of that hardware is bringing nonstop computing within reach of the companies that need it.

The digital-communications revolution also places increasing emphasis on nonstop computing. The distinction between computers and communication systems is already quite blurry. Whether the communication system is a central office or a fax server on a network, downtime is anathema. Computer-industry observers expect a new class of computer system--the communications computer--to emerge. Fault tolerance will be a hallmark of such systems.

IMP systems can also contain three CPUs. When one disagrees with the other two, the CPU that disagrees is assumed to be in error. Some systems from Tandem use a similar voting scheme. IMP's three-processor systems offer an additional advantage: If one of the three CPUs fails, the remaining two CPUs continue to provide fault tolerance. Although all true fault-tolerant systems continue operating after a failure, many cannot operate after a second failure until at least one of the failures is repaired.

Systems from Sequoia use two CPUs per computing element, but the number of computing elements varies. Sequoia's system software dynamically allocates tasks to the computing elements. If the two CPUs in an element disagree, the system takes the offending element off-line, redistributes the tasks among the remaining elements, and rolls back all elements' system-activity records to the point before the fault occurred. In systems that contain three or more computing elements, Sequoia's approach, like IMP's, results in continued fault tolerance despite multiple failures. Themis uses a similar approach. Some Themis systems run applications built with the aid of Isis tools.

Stratus systems also use two CPUs per computing element. Many Stratus CPUs contain multiple µPs, however. Systems contain two computing elements. The pair of CPUs within a computing element and the pair of computing elements in the system all run in lock step. (In other words, four CPUs run in lock step.) If the two CPUs in a computing element disagree, the computing element goes off-line before the error can propagate beyond the computing element. Hence, there is rarely a need to roll back the image of system activity, and there is no loss of computational power.


Attention to detail

Although the following discussion emphasizes features of Stratus' Continuum systems, the goal is not the promotion of Stratus over its competitors. The discussion appears because continuous-availability systems represent the ultimate in commercial-product reliability. Although other vendors' systems differ from Stratus' in detail and even in major elements of design philosophy, the attention to detail is common to all. In your own application, you must decide how closely you need to approach the reliability of continuous-availability systems. Based on that appreciation, you should be able to decide how many of the concepts of continuous-availability systems your product must incorporate.

Fault-tolerant systems duplicate much more than the CPUs and computing elements. Other duplicate elements include memory and I/O buses, communication links, cooling fans, and peripheral controllers. (Note that using a RAID with a single disk controller does not safeguard against a controller failure.) Some vendors duplicate backplanes, although Stratus doesn't. All data transferred over buses include error-correction codes.

Stratus systems use mirrored hard-disk arrays (RAID level 1) but use only one mirrored-array pair per system. To achieve high transfer rates, the arrays use data striping. (That is, they divide the data accessed in single reads and writes among multiple drives that read and write simultaneously.) The drives use error correction internally, but the systems do not duplicate drives within either half of a mirrored-array pair. (In RAID terminology, the level-1 pair comprises two level-0 arrays.)

Besides constantly comparing the results of computations, Stratus systems continually execute diagnostic routines as background tasks. The routines are hard-coded into the CPU-board logic and, thus, do not directly involve the CPUs. The operating system is oblivious to the diagnostics; hence, the diagnostics don't cause Heisenbugs.

When a failure occurs, the system element that detects the failure

At the factory or field repair center, the diagnostic message aids the repair technician. The message also remains inside the failed module, even after repair, providing a lifelong indication of the unit's repair history.


Picture Five

The IEEE-1149.1 boundary-scan-testability standard figures heavily in Stratus' hardware architecture. For example, the system writes its diagnostic messages into nonvolatile RAM through the RAM chips' IEEE-1149.1 test-access ports (TAPs). Stratus design standards require that all digital ICs the company uses incorporate IEEE-1149.1. Designs that use noncompliant devices require the approval of a Stratus vice president.


Avoiding human error

Human error can defeat the best fault-tolerant design. Because most repairs to fault-tolerant systems are made by customer personnel while the systems remain on-line, Stratus and other companies use a series of red, yellow, and green LEDs to indicate module status. The LEDs guide relatively unskilled operators to replace the correct modules and, more important, not to remove modules that must remain in place. A green light signifies that you can remove a module. Pulling a module that displays a red light will cause the system to fail. A flashing yellow light indicates a module that has gone off-line because of a problem. A module that displays both green and yellow lights is, in effect, warming up.

Warm-up consists of the following steps: When you plug a module in, it first runs a group of self tests. Then, it synchronizes with the system clocks and copies the contents of all memory and registers from its counterpart. Copying goes on in the background while the counterpart module remains on-line. The copying process can take several minutes, during which the system performance may degrade somewhat.

Stratus systems use a distributed power architecture. Redundant bulk supplies with battery backup deliver loosely regulated 48V dc to the modules, each of which includes a pair of dc/dc converters. The bulk supplies receive power from separate ac mains through separate circuit breakers and power switches. The separate switches prevent operators from taking a system off-line by inadvertently powering it down. Within the modules, each CPU in a computing element has its own dc/dc converter.

Like systems that offer mere high availability, Stratus systems monitor reliability-related functions, such as airflow, temperature, and supply voltages. Unlike some high-availability systems, though, these continuous-availability systems do not involve the main processors in such mundane tasks. Instead, dedicated maintenance pro-cessors handle these low-level chores.

Some of Stratus' hot-swapping features resemble those of VME-64. When you insert or remove modules, special connectors assure that the ground connections are the first made and the last broken. Bus drivers remain in a high-output-impedance state until a module is ready to transmit data. Line receivers do not accept data from the bus until the module has successfully run diagnostics. Shorting adjacent backplane pins does not affect system operation. If you observe the red lights, the only effect of inserting and removing modules is a slight throughput degradation as replacement modules copy information from modules that are on-line.


Contain those faults

Stratus hardware emphasizes fault containment--keeping faults from propagating beyond the element in which they first appear. So, too, does the company's OS software. The company's original systems used a proprietary OS, VOS, which Stratus still supplies. However, application developers are generally more familiar with Unix. Many more applications run under Unix than under VOS, so the company now also offers FTX, its own version of Unix System V release 4 for multiprocessing (SVR4 MP). Sequoia and Tandem also offer OSs that run unmodified Unix applications. IMP and Themis use a standard OS--Sun Microsystems' Solaris, which is Unix-compatible.


Picture Six

Except for Solaris, these multitasking OSs incorporate hardened kernels, which are designed to prevent applications from causing system panics. (A "panic" is the Unix term for a crash.) The first objective is to contain each problem within the application in which it arose and not to let the problem affect other applications running concurrently. The system vendors have also rewritten key parts of Unix to prevent known problems. For example, Unix is known to panic when applications ask it to allocate more resources than a system can offer.

IMP adds fault tolerance to Solaris by using hardened device drivers. The company publishes the rules for writing these drivers so that peripheral-device suppliers can also write them. This approach is an important element of IMP's marketing strategy. The company markets its systems through telecommunications OEMs that incorporate the systems into products such as message switches and private branch exchanges (PBXs). Such systems include substantial amounts of specialized hardware, for which the communications OEMs must supply custom drivers.

By now, you should understand that fault tolerance, whether provided by software on a network of standard workstations or by a combination of software and special hardware, offers benefits even in situations unrelated to faults. Upgrades to fault-tolerant systems' hardware and software need not cause service interruptions. You take the element to be upgraded off-line, perform the upgrade, and then put the element back on line.

However, when an upgrade involves stored information, you may have to wait quite awhile before the upgrade is really complete. When you load new software onto a mirrored drive, many hours or even days can pass before the image of the software on the counterpart disk becomes identical to that on the upgraded drive. The system must continue to function--albeit with degraded or nonexistent redundancy--while it synchronizes the drives' contents in the background. Depending on the quantity of data to be copied and the allowable degree of performance degradation, copying can take a substantial time.


Dan Strassberg
You can reach Senior Technical Editor Dan Strassberg at (617) 558-4205, fax (617) 928-4205 or on the Internet at ednstrassberg @cahners.com.


Reference

1. Travis, Bill, "SCSI-based RAID systems provide storage redundancy, enhance data-transfer rate," EDN, Nov 23, 1994, pg 81.


Suppliers of mission-critical computing products
When you contact any of the following manufacturers directly, please let them know you read about their products at the EDN Magazine WWW site.
Diversified Technology
Ridgeland, MS
(601) 856-4121
Force Computers Inc
San Jose, CA
(408) 369-6000
Harting Elektronik Inc
Hoffman Estates, IL
(708) 519-7700
I-Bus
San Diego, CA
(619) 974-8400
IMP (Integrated Micro Products)
Dallas, TX
(214) 980-2771
Isis Distributed Systems Inc
Marlborough, MA
(508) 460-2430
Micro Alliance Inc
Vista, CA
(619) 598-1900
Micro Linear Corp
San Jose, CA
(408) 433-5200
National Semiconductor Corp
Santa Clara, CA
(800) 628-7364
Radstone Technology
Montvale, NJ
(201) 391-2700
Sequoia Systems Inc
Marlborough, MA
(508) 480-0800
Stratus Computer Inc
Marlborough, MA
(508) 460-2000
Tandem Computers Inc
Cupertino, CA
(408) 285-6000
Texas Instruments Inc
Denver, CO
(800) 477-8924
Texas Microsystems Inc
Houston, TX
(713) 541-8200
Themis Computer
Fremont, CA
(510) 252-0870
Vero Electronics Inc
Hamden, CT
(203) 288-8001


| EDN Access | feedback | subscribe to EDN! |
| design features | design ideas |


Copyright © 1995 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.