Feature
Sense of self: enabling systems to monitor—and control—their environment
There's nothing quite like having to explain to your boss that your entire network is down because a $12 fan died. Proper management of the enclosure environment lets you pre-empt such disastrous and embarrassing failures.
By Nicholas Cravotta, Technical Editor -- EDN, 10/25/2001
|

The telecomm industry already demands high availability, and if data communications wants to compete for voice and video services, it needs to aim for a similar standard. Also known as five nines, high availability refers to systems that are available 99.999% of the time, which is about five minutes of downtime, including time for scheduled maintenance, per year. The most common way to achieve this goal is through redundancy of hardware components. To achieve efficient high availability, however, you need to monitor a system to identify potential failures before they happen and actively prevent them.
System monitoring occurs on many levels. For example, a host processor can watch the statistical performance of nodes to determine whether a node has experienced more errors or has dropped more packets than it should. The problem is that the point of failure is not always where the actual failure is. For example, a fan that fails on one line card can result in the overheating and failure of adjacent line cards. You need enclosure, or environmental, management at the physical level to identify potential problems before they cause failures and, as important, to help pinpoint their location.
Fans, sensors, and monitorsGiven the higher power dissipation of increasingly more compact equipment, enclosure management must occur within the enclosure or chassis. The tools for enclosure management include various sensors that detect variations in voltages, temperature, and humidity; fans that control heat dissipation; and monitors that collect information from sensors and decide either to act on it or to pass it up to a host server with more intelligence.
Measuring temperature requires a sensor to determine the temperature and convert it to a value an enclosure-management system can understand. Analog sensors create an output voltage that is proportional to the sensed temperature. Note that the voltage is not necessarily a linear representation of temperature; twice the voltage is not twice the temperature. The accuracy of sensors varies, with best tolerances of 0.33 to 3°C, depending on the vendor. For a microprocessor to use the temperature information, the voltage must pass through an ADC and then be transformed using an equation that the sensor vendor provides.
Digital sensors, such as National Semiconductor's 75-cent (1000) LM75 digital temperature sensor, tend to be more intelligent than analog sensors. A digital sensor integrates an ADC and voltage transform, allowing the sensor to digitally output the temperature over a standard bus, such as an I2C. An I2C bus also allows a management processor to configure, update, control, and poll a sensor. Many digital sensors also support high and low thresholds, and the sensor sends an alert over the bus if the temperature exceeds either threshold. Unfortunately, unless the sensor can be a master on the bus (most are slave-only), it passes information up the system hierarchy only when the management processor polls it. Thus, the latency of an alert can be as high as the time between polls. To reduce this latency, many sensors have two additional outputs, active-high threshold and active-low threshold, that trigger if the temperature exceeds the appropriate threshold. You can run this output to an external interrupt on a management processor, signaling that it should immediately poll the sensor for the alert.
At some point, however, you need to aggregate sensor data so that you can act on it. A host server can monitor sensors, but enclosure management is a monotonous, low-frequency task. You can give this task to a main server processor, but consider that the server processor will be unable to do other tasks that generate revenue. Offloading the task to an 8- or a 16-bit processor or dedicated monitor frees up the server processor. Many designs use an 8051 processor, which can sufficiently handle the processing of sensor data. However, such designs may require a handful of additional chips to provide additional I/O and buses to support enough sensors, as well as driver software (see sidebar "Just can't get enough I/O").
Monitor, or controller, sensors vacuum up many of these functions into one chip and provide software libraries for functions such as interpreting sensor input, driving fans, and passing alarms to a host server. They typically do not have integrated sensors but can aggregate and manage other sensors through either general-purpose I/Os or an I2C bus. Most sensors can directly drive fans using an on-chip PWM. Some sensors can drive LEDs (for example, to show which card you should remove during maintenance), buzzers (for direct alert), or even voice; the WinBond W83791D costs $2 to $3 (1000) and can tell an operator, "I'm too hot" or "Fan is wearing out." Monitor prices range from a few dollars for a monitor that supports a couple of sensors and fans to more than $50 for a monitor that provides a direct connection to an in-band bus, such as Fibre Channel; several RS-232 ports; multiple I2C buses; more than 50 I/Os; and software.
To make a monitor useful, you must configure it to the specifications of the system it monitors. For example, a monitor can determine that the temperature is rising past a certain threshold and increase the speed of the fan without ever contacting the host server. Two kinds of software enable this process: drivers for the sensors, fans, and other devices and software to define the personality of the monitor. Some monitor vendors supply libraries with drivers to support a variety of sensors, LEDs, buzzers, and other devices. Programming the personality of the monitor can be as complex as requiring 8051 assembly to a high-level language, such as C/C++, or using a GUI-based configuration application that allows you to set the threshold and define rules. Personalities can be a few kilobytes for simple monitoring and into the megabytes for added features, such as Web servers or direct e-mail support. You can also provide code to collect status information and pass it up to a host server for further evaluation. For example, the monitor may adjust the fan, but it doesn't know how often it adjusts the fan or that adjusting the fan is often an indicator that the fan may be ready to fail.
The key to designing an effective and efficient enclosure-management architecture is deciding where to locate the intelligence. For centralized designs, monitors poll dumb sensors for status information and decide how to act on it. Smart sensors enable a passive approach to monitoring that takes advantage of the fact that as long as the condition you are measuring stays within defined thresholds, you don't need to take action or bother a host server with such details. It is only when conditions exceed a sensor's thresholds that the sensor will send a warning up to the server (see sidebar "A sensor by any other name").
You can dedicate a line card to the monitoring task or, somewhat equivalently, plug an enclosure-management module into a line card. Depending on how many LEDs, alarms, and other components you want to support, this approach can take up an entire card. Unfortunately, this option uses up a valuable slot in the enclosure. Alternatively, you can locate the enclosure-management intelligence on the active backplane. However, if the enclosure-management subsystem fails, you have to pull everything out of the backplane before you replace it. This process can be costly in downtime, which is why passive backplanes—basically, a bunch of connectors and no active or intelligent components to break down—tend to be more common.
In band, out of bandWith intelligence spread throughout the enclosure and out to a host server, you need a reliable means for passing enclosure-management information around so that you can effectively use it. One school of thought suggests using any existing infrastructure as a datapath (in band). For example, the enclosure-management system in a Fibre Channel rack can pass information to a host server via the Fibre Channel connection. Such a scheme has the advantage of keeping down costs. Unfortunately, if the Fibre Channel connection fails, the link with the host server also fails. This situation results in the loss of access to information that can pinpoint a failure. Instead, a technician knows only that something on the rack has failed. Additionally, monitoring data that you need to pass in a crisis can impact data throughput.
The disadvantage of passing data out of band—that is, on a dedicated bus—is that it can be expensive to run another bus between the enclosure that you're monitoring and a host server, especially if the two are physically far away from each other. However, if you are designing a high-availability system, then you need to provide a redundant datapath for reliability anyway. In such cases, having both an in-band and an out-of-band datapath covers most of your bases. Note that unless you're also in control of the software running on the host server, you need to send data in a standard format, such ESI (enclosure-services interface), SAF-TE (SCSI-access fault-tolerant enclosure, usually over SCSI), SES (usually over FC), or IPMI (Intelligent Platform-Management Interface) (see sidebar "Enclosure initiatives").
The intelligence hierarchyIf a sensor cannot handle an event, it passes an alert to a monitor. If a monitor cannot handle the event (speeding up the fan does not sufficiently reduce the temperature), it passes an alert up to the next layer of intelligence. In some cases, this next level is a human being who then decides how to take action. For larger systems, however, it makes sense to add another layer of management through middleware or support software. For example, a monitor might send an alert whenever someone opens the enclosure. Using middleware, the host server can determine whether a service technician has opened the correct box during servicing and flashes LEDs if the technician has made a mistake and is about to pull out the wrong card.
Middleware also handles mapping out a system by taking care of issues such as node discovery or hot insertion. If you add a drive to a system, you have to monitor that drive as well. Without the appropriate software, you must remember to correctly configure a monitor to recognize the new drive. However, with the appropriate software, initializing monitoring of the new drive may be automated. This approach also enables a network administrator to view device environmental status with the same logical map he or she used to monitor device performance.
Given the simplicity of sensors, you have little room to differentiate a product at that level. However, significant room exists to differentiate how to collect, process, and act on that data. Enclosure-management support software and middleware play an important role in managing the monitoring system and aggregating multiple systems under one management umbrella. For example, statistics collected over time can reveal patterns of failure that you can address as they start to appear. Middleware also decides how to handle alerts and how, if necessary, to contact the right person, such as a network administrator, in an appropriate way. Additionally, middleware can hand over data to the application layer for display formatting suitable for you to access.
With enclosure-management support software, you must decide whether you want off-the-shelf support or a proprietary package. Proprietary software requires you to invest in creating and then supporting the middleware. One advantage is that you then have complete control over what you monitor and how you process that information; for example, you don't have to figure out elaborate ways of working around a protocol that doesn't allow you to do exactly what you want to do. You can also provide important features, such as automatic configuration, that off-the-shelf middleware doesn't support. If your software supports only your devices, this may be an advantage in that it forces your customers to buy only your products if they want comprehensive enclosure management. However, supporting your own devices may also lock you out of certain markets. One compromise is to support other vendors' equipment, but you should also be able to provide extra features when you work with your equipment. If you don't want to deal with creating middleware, then you need to design your overall enclosure-management scheme to support protocols and features that off-the-shelf middleware supports.
Who monitors the monitors?For maximum reliability, a system needs a redundant enclosure-management subsystem. Such redundancy can extend to a level as low as primary and backup sensors. Alternatives to supporting primary and backup sensors include placing sensors so that they overlap in coverage or implementing a method for testing individual sensors on an automated and regular basis. At the highest level, redundancy demands independent monitoring loops, which include distinct datapath buses to a host server.
Most important, redundant enclosure-management subsystems monitor each other. Both subsystems are connected, usually over an I2C bus, and monitor each other through a shared "heartbeat." A heartbeat can be as simple as a pulse or polled request that each subsystem responds to to let the other subsystem know that it is still active. If the heartbeat fails, the appropriate enclosure-management subsystem takes over monitoring the entire system and runs diagnostics on the failed loop to locate the problem.
To be truly robust, the enclosure-management subsystem must be able to survive a complete system failure. For example, the subsystem must have its own power supply; an independent means for sending alarms outside the system, such as through a direct phone connection because the system cannot transfer alerts; and a means for describing the problem and its location. For example, if the host server is down, middleware is unavailable to process data for human access; a Web server can be useful for such a contingency. Web servers are also useful because they allow a technician to directly plug into the enclosure and retrieve data rather than having to trek back and forth to another computer or interpret a confusing bank of flashing LEDs (see sidebar "Considerations for building enclosures").
Enclosure management is a tool that prevents costly or catastrophic failure. Some of the prevention measures may seem extreme, depending on the application; however, the consequences of downtime may make them absolutely necessary.
| For more information... | ||
| When you contact any of the following manufacturers directly, please let them know you read about their products in EDN. | ||
| Agilent 1-650-752-5000 www.agilent.com | Analog Devices 1-781-329-4700 www.analogdevices.com | Mylex (a business unit of IBM) 1-510-796-6100 www.mylex.com |
| National Semiconductor www.nsc.com | Pentair 1-401-732-3770 www.pentair-ep.com | Philips 1-800-234-7381 www.semiconductors.philips.com |
| QLogic 1-949-389-6000 www.qlogic.com | Summit Microelectronics 1-408-378-6461 www.summitmicro.com | Triscend 1-650-968-8668 www.triscend.com |
| Vitesse 1-805-388-3700 www.vitesse.com | Winbond Electronics Corp America 1-408-943-6666 www.winbond-usa.com | IPMI and ICMB resources http://developer.intel.com/design/servers/ipmi/ |
| Author Information |
You can reach Technical Editor Nicholas Cravotta at 1-510-558-8906, fax 1-510-558-8914, e-mail ednnick@pacbell.net. |
| Acknowledgment | ||
| Special thanks to Mardy Marshall, director of Core Systems at WaterCove Networks (www.watercove.com), for his contributions to this article. | ||
|
|















You can reach Technical Editor Nicholas Cravotta at 1-510-558-8906, fax 1-510-558-8914, e-mail