High-availability Internet servers: Linux clustering on a CompactPCI platform

Customer demand for increased bandwidth with minimum downtime has forced network service providers to specify server systems with high-availability features. CompactPCI hardware supporting multiprocessor Linux clusters is one way to keep up with this demand.

Harald Mueller and Russell Scott, PEP Modular Computers -- EDN, 12/7/2000

Rapid growth in the use of the Internet and other communication technologies has created a tremendous challenge for both ISPs (Internet service providers) and companies that operate Internet servers. According to recent estimates, ISPs need to double their performance capabilities every 100 days to meet the exploding demand. The 24-hours-a-day, seven-days-a-week nature of Internet access means that you must develop server systems to provide maximum availability—that is, with little or no downtime.

Internet businesses that simply expand the number or capacity of current servers are likely to encounter significant space and cost constraints in the near future. Thus, they must design new system options for this market that maximize performance while reducing system size to fulfill the anticipated performance needs and to help their users remain competitive.

Running Linux cluster systems on CompactPCI platforms has many advantages and is one way to meet the challenges of growing demand. The use of clustering can provide high availability, and Linux is playing an increasing role in clustering options.

The need for high availability

Businesses have traditionally used high-availability systems to provide reliable operation for critical corporate applications and industrial process monitoring and control. Customers' increasing requirements for reliability are making high-availability systems an economic necessity in the telecom-munications and Internet industries. However, almost all businesses that depend on information technology (particularly those running with little or no downtime) would benefit from an approach based on high availability, because systems are becoming increasingly complex; the cost of system failure and data loss is potentially devastating.

An established method for businesses to evaluate the required level of system reliability is to estimate the total annual unplanned outages. Table 1 correlates levels of availability with the projected minutes per year of downtime.

Telecommunications systems usually require 99.999% reliability, whereas for other applications, lower reliability levels may be adequate. Because increasing system availability is expensive, you must weigh the cost of incremental improvements against the potential cost of downtime. However, experience indicates that most businesses currently operate systems with an insufficient, rather than excessive, level of system availability.

High availability and clustering

High availability is not a function only of a system's hardware and software; it is also influenced by how humans operate the system and by the environmental conditions under which the system operates. Because overall system availability is based on the reliability of a system's subsystems and components, the most commonly used method to assess and improve reliability is identifying and providing redundancy for SPOF (single points of failure). Various industries have demonstrated the value of this technique for decades, especially the aerospace industry.

In the traditional approach to high availability, you duplicate critical components and hold them in some form of "hot standby;" the backup components assume critical functions when the primary components fail. You must also incorporate hardware and software support into the system to enable the backup components to step in when necessary.

Almost all server components can represent SPOF—from the CPU to the power supply, the network interface cards, the storage drives, and so on. System designers may provide redundancy for any or all of these functions, depending on the server's application environment. RAID is the most popular form of redundancy that single servers use; it primarily manages internal networks. This approach duplicates data from the primary storage disks onto a second set of disks, enabling the system to maintain data integrity during failure.

The traditional approach suffers from two serious problems, particularly when you consider the explosive growth in required system capabilities anticipated for Internet businesses. First, it is wasteful: Usually, the resources held in hot standby are as costly as those they duplicate, and most of the time, they are essentially unused. Second, for certain components (such as CPUs) in the PC environment, implementing the "hot-standby backup" approach remains difficult or impossible.

For these reasons, an increasingly popular approach to high availability is to link fail-safety with some form of load sharing among resources. Companies have developed a number of system architectures—generally known as clusters—that are based on even distribution of load across participating systems.

You can define "cluster" as a grouping of nodes that you use in combination to complete a task. You most commonly use the term to reference a set of PCs, but you can also apply the term to one computer with multiple (hundreds or even thousands) processors. The keys to clustering are that the individual nodes are interconnected and that they communicate via specialized connections, such as a SCSI or a Fast Ethernet link.

You can use the cluster approach for a variety of purposes. To begin with, clustering can help provide the state-of-the-art computing power that you need for applications such as weather forecasting. These applications are characterized by extremely high computing throughput for each individual node or processor combined with extremely fast interconnections between nodes.

The cluster approach can also help with load distribution and balancing, which load-critical applications, such as large databases and Web servers, require. Systems based on this approach provide performance that is difficult or impossible to obtain with single computers. Multiple processors share the load in such a way that the cluster appears to outside users as one computer. These systems are characterized by high performance I/O units, such as hard disks and network interfaces.

Finally, the cluster approach provides high availability and fail safety, particularly for mission-critical applications, such as nuclear-power plants and telecommunication services. In this approach, the failure of one or more nodes does not lead to failure of the overall cluster. Depending on the system configuration, a failure may result in a reduced level of service but preserve critical functions. The load-sharing features of such clusters permit multiple levels of service, based on the nature and extent of the failure. The nature of applications that a given cluster serves may make combining two or all three of these approaches advantageous.

Figure 1 shows a typical two-node cluster configuration. The nodes share disk resources through SCSIs and communicate with each other via Ethernet, as well as a "heartbeat" RS-232/serial-port connection not used for information processing.

Clustering using Linux

Linux is a Unix-type operating system offering high stability (like other Unix systems) while remaining somewhat user-unfriendly. However, because of its freely available source code, Linux has enjoyed rapid adoption and the highest growth rate among currently available OSs within the server market.

In 1998, Linux-based systems comprised an estimated 17.2% of the total server market (based on a total of 4.3 million new server deliveries), compared with 37% for WindowsNT (Reference 1). Within the Web-server market, however, Linux has grown to become the leading operating system. Table 2 summarizes the percentages of Web servers based on the indicated OSs, as of June 2000 (Reference 2).

The popularity and growth of Linux-based servers for the Web and other purposes is expected to continue; market researchers have projected an average annual growth rate of 25% by 2003. A key to maintaining this rapid growth is the increasing level of support for high-availability Linux applications that the cooperative efforts of individual commercial vendors and developers provide. In addition, the Linux open-source-code policy remains a strategic advantage.

Although the release of Linux kernel version 2.0 provided some of the features necessary for software support of high-availability systems, a number of developers are working to extend its functions. The Web site for the High-Availability Linux Project (http://linux-ha.com) is an excellent source of updates on the development of high-availability Linux software, with links to individual efforts and options. In addition, several vendors have addressed the need for high-availability software solutions. A small sample of current packages includes:

  • Apptime Technologies' "light" version of its cluster software Watchdog for Linux, which targets Web and mail servers and similar applications;
  • Technauts' eServers, which offer a quoted availability of 99.99% (in the case of failure, a second system takes over within a few seconds); and
  • TurboLinux's TurboCluster Servers, which provide automatic load distribution using multiple computers, combining the Linux OS with high-availability software and configuration tools.

These packages represent only a small sample of the available commercial offerings. Interested users should consult the extensive Linux information resources available on the Internet.

Linux user community expressed one significant concern about the possibility of a fragmented "Linux Standard," resulting from increased commercial involvement. Even so, it is clear that commercial suppliers have recognized the need for Linux-based options for high-availability applications, and the market should help "standardize" the most useful options.

Combining Linux with CompactPCI

The CompactPCI specification has gained wide acceptance within the world of industrial computing. CompactPCI was developed as an adaptation of the PCI specification, which in turn laid out requirements for industrial and embedded applications that call for more robust mechanical interconnections than those used in office-based systems. CompactPCI is an open specification, supported and controlled by the PICMG (PCI Industrial Computers Manufacturing Group).

Many features of the CompactPCI specification are attractive to systems engineers designing high-availability servers. CompactPCI was designed so that computers could function reliably in harsh environments (typical of industrial applications). These systems must tolerate vibration, dust, temperature extremes, and other conditions that would cripple normal PCs. Also, the mean time between failures is significantly longer for CompactPCI systems than for normal PCs. However, the MTTR is significantly shorter.

System engineers also appreciate that CompactPCI uses the Eurocard industry-standard 3U (100×160-mm) and 6U (233.35×160-mm) form factors (Reference 3). You can design systems that fit into tight spaces using a range of standard and custom rack solutions. Further, routing I/O through a CompactPCI backplane reduces the chance of interconnection disruption, while enabling rapid adjustments, upgrades, and replacements from the rear.

Hot-swap technology—the ability to replace or supplement boards while the system is running—is also an important feature. It requires support from the system hardware, operating system, and application software.

These attributes enable CompactPCI to support the essential features of high-availability and small-form-factor serers. As a result, the platform is ideal for meeting the challenges of the explosive growth in Internet-server requirements. The use of CompactPCI for Internet servers introduces a third crucial ingredient for such a rapidly moving field: standardization. Anticipated upgrades and replacements are much more straightforward when you can consistently configure them to the CompactPCI specification.

You can combine high availability and clustering support for Linux with CompactPCI to create a virtual system based on two or more individual CPU units installed in the same CompactPCI backplane. During normal operation, Linux clustering permits load sharing among the individual CPUs. In the event of a CPU failure, the software can distribute functions over the remaining CPUs. Therefore, it provides both redundancy and load balancing in a manner that is imperceptible to system users.

Current Linux clustering-software packages provide a clear method of establishing redundant systems that easily accommodate subsequent performance upgrades. Combining this approach with the CompactPCI platform enables user-friendly maintenance and expansion, while maintaining high availability.

Figure 2 illustrates a Web-server configuration from PEP Modular Computers. The Linux cluster support is based on TurboLinux. The server comprises two separate network computers, which together form a common URL domain. Although each CPU connects via a Fast Ethernet link, both CPUs reside in one CompactPCI system. The two computers share user access to the Web-server Internet pages, depending on loading. In the event of computer failure (from either a hardware or software fault), the second system transparently assumes the task of the failed computer within seconds.

In principle, you could easily expand this approach to three or more computer units installed in the same rack. You could also apply the same general framework to performance- and processor-intensive applications in industrial computing, process control, and automation.

Linux clustering on a CompactPCI platform can create an Internet server that is compact, reliable, and amenable to enhancements and performance upgrades. The intensive ongoing work on clustering and high-availability software under Linux ensures that companies will continue to develop new and even more user-friendly solutions and offer them through both commercial and noncommercial channels.


For more information...
For information on subjects discussed in this article, use EDN's information-request service. When you contact any of the following manufacturers directly, please let them know you read about their products in EDN.
Apptime Technologies Inc
1-781-245-3366
www.apptime.com
Enter No. 301
International Data Corp
1-508-872-8200
www.idc.com
Enter No. 302
PCI Industrial Computer Manufacturers Group
1-781-246-9318
www.picmg.com
Enter No. 304
PEP Modular Computers Inc
1-412-921-3322
www.pep.com
Enter No. 303
Technauts Inc
1-919-462-1713
www.technauts.com
Enter No. 305
TurboLinux Inc
1-650-228-5000
www.turbolinux.com
Enter No. 306

Author info

Harald Mueller is a project manager in the communications division of PEP Modular Computers. He holds a Diploma Ingenieur in electrical engineering from Fachhochshule Ravensburg-Weingarten (Germany).

 

Russell Scott is a technical support manager at PEP Modular Computers, where he has worked for eight years. He has a BSEE from Virginia Polytechnic Institute and State University (Blacksburg, VA).

REFERENCE

1.International Data Corp, www.idc.com.

2. The Netcraft Web Server Survey, www.netcraft.com/survey.

3. CompactPCI Specification, 2.0 R3.0, www.picmg.com.




ADVERTISEMENT

ADVERTISEMENT

Feedback Loop


Post a CommentPost a Comment

There are no comments posted for this article.

Related Content

 

By This Author

There are no additional articles written by this author.


ADVERTISEMENT

Knowledge Center



Technology Quick Links

EDN Marketplace


©1997-2008 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites

ADVERTISEMENT
You will be redirected to your destination in few seconds.