Feature

Sense of self: enabling systems to monitor—and control—their environment

There's nothing quite like having to explain to your boss that your entire network is down because a $12 fan died. Proper management of the enclosure environment lets you pre-empt such disastrous and embarrassing failures.

By Nicholas Cravotta, Technical Editor -- EDN, 10/25/2001

AT A GLANCE
  • Enclosure management is a tool that prevents costly or catastrophic failure.
  • Enclosure management allows faster identification of failures than traditional troubleshooting approaches.
  • Monitors aggregate sensor data and provide a base level of intelligence for acting on that information.
  • Middleware enables collection and analysis of environmental data on a systemwide level.
Sidebars:
Considerations for building enclosures
A sensor by any other name
Enclosure initiatives
Acronyms
Just can't get enough I/O

The telecomm industry already demands high availability, and if data communications wants to compete for voice and video services, it needs to aim for a similar standard. Also known as five nines, high availability refers to systems that are available 99.999% of the time, which is about five minutes of downtime, including time for scheduled maintenance, per year. The most common way to achieve this goal is through redundancy of hardware components. To achieve efficient high availability, however, you need to monitor a system to identify potential failures before they happen and actively prevent them.

System monitoring occurs on many levels. For example, a host processor can watch the statistical performance of nodes to determine whether a node has experienced more errors or has dropped more packets than it should. The problem is that the point of failure is not always where the actual failure is. For example, a fan that fails on one line card can result in the overheating and failure of adjacent line cards. You need enclosure, or environmental, management at the physical level to identify potential problems before they cause failures and, as important, to help pinpoint their location.

Fans, sensors, and monitors

Given the higher power dissipation of increasingly more compact equipment, enclosure management must occur within the enclosure or chassis. The tools for enclosure management include various sensors that detect variations in voltages, temperature, and humidity; fans that control heat dissipation; and monitors that collect information from sensors and decide either to act on it or to pass it up to a host server with more intelligence.

Measuring temperature requires a sensor to determine the temperature and convert it to a value an enclosure-management system can understand. Analog sensors create an output voltage that is proportional to the sensed temperature. Note that the voltage is not necessarily a linear representation of temperature; twice the voltage is not twice the temperature. The accuracy of sensors varies, with best tolerances of 0.33 to 3°C, depending on the vendor. For a microprocessor to use the temperature information, the voltage must pass through an ADC and then be transformed using an equation that the sensor vendor provides.

Digital sensors, such as National Semiconductor's 75-cent (1000) LM75 digital temperature sensor, tend to be more intelligent than analog sensors. A digital sensor integrates an ADC and voltage transform, allowing the sensor to digitally output the temperature over a standard bus, such as an I2C. An I2C bus also allows a management processor to configure, update, control, and poll a sensor. Many digital sensors also support high and low thresholds, and the sensor sends an alert over the bus if the temperature exceeds either threshold. Unfortunately, unless the sensor can be a master on the bus (most are slave-only), it passes information up the system hierarchy only when the management processor polls it. Thus, the latency of an alert can be as high as the time between polls. To reduce this latency, many sensors have two additional outputs, active-high threshold and active-low threshold, that trigger if the temperature exceeds the appropriate threshold. You can run this output to an external interrupt on a management processor, signaling that it should immediately poll the sensor for the alert.

At some point, however, you need to aggregate sensor data so that you can act on it. A host server can monitor sensors, but enclosure management is a monotonous, low-frequency task. You can give this task to a main server processor, but consider that the server processor will be unable to do other tasks that generate revenue. Offloading the task to an 8- or a 16-bit processor or dedicated monitor frees up the server processor. Many designs use an 8051 processor, which can sufficiently handle the processing of sensor data. However, such designs may require a handful of additional chips to provide additional I/O and buses to support enough sensors, as well as driver software (see sidebar "Just can't get enough I/O").

Monitor, or controller, sensors vacuum up many of these functions into one chip and provide software libraries for functions such as interpreting sensor input, driving fans, and passing alarms to a host server. They typically do not have integrated sensors but can aggregate and manage other sensors through either general-purpose I/Os or an I2C bus. Most sensors can directly drive fans using an on-chip PWM. Some sensors can drive LEDs (for example, to show which card you should remove during maintenance), buzzers (for direct alert), or even voice; the WinBond W83791D costs $2 to $3 (1000) and can tell an operator, "I'm too hot" or "Fan is wearing out." Monitor prices range from a few dollars for a monitor that supports a couple of sensors and fans to more than $50 for a monitor that provides a direct connection to an in-band bus, such as Fibre Channel; several RS-232 ports; multiple I2C buses; more than 50 I/Os; and software.

To make a monitor useful, you must configure it to the specifications of the system it monitors. For example, a monitor can determine that the temperature is rising past a certain threshold and increase the speed of the fan without ever contacting the host server. Two kinds of software enable this process: drivers for the sensors, fans, and other devices and software to define the personality of the monitor. Some monitor vendors supply libraries with drivers to support a variety of sensors, LEDs, buzzers, and other devices. Programming the personality of the monitor can be as complex as requiring 8051 assembly to a high-level language, such as C/C++, or using a GUI-based configuration application that allows you to set the threshold and define rules. Personalities can be a few kilobytes for simple monitoring and into the megabytes for added features, such as Web servers or direct e-mail support. You can also provide code to collect status information and pass it up to a host server for further evaluation. For example, the monitor may adjust the fan, but it doesn't know how often it adjusts the fan or that adjusting the fan is often an indicator that the fan may be ready to fail.

The key to designing an effective and efficient enclosure-management architecture is deciding where to locate the intelligence. For centralized designs, monitors poll dumb sensors for status information and decide how to act on it. Smart sensors enable a passive approach to monitoring that takes advantage of the fact that as long as the condition you are measuring stays within defined thresholds, you don't need to take action or bother a host server with such details. It is only when conditions exceed a sensor's thresholds that the sensor will send a warning up to the server (see sidebar "A sensor by any other name").

You can dedicate a line card to the monitoring task or, somewhat equivalently, plug an enclosure-management module into a line card. Depending on how many LEDs, alarms, and other components you want to support, this approach can take up an entire card. Unfortunately, this option uses up a valuable slot in the enclosure. Alternatively, you can locate the enclosure-management intelligence on the active backplane. However, if the enclosure-management subsystem fails, you have to pull everything out of the backplane before you replace it. This process can be costly in downtime, which is why passive backplanes—basically, a bunch of connectors and no active or intelligent components to break down—tend to be more common.

In band, out of band

With intelligence spread throughout the enclosure and out to a host server, you need a reliable means for passing enclosure-management information around so that you can effectively use it. One school of thought suggests using any existing infrastructure as a datapath (in band). For example, the enclosure-management system in a Fibre Channel rack can pass information to a host server via the Fibre Channel connection. Such a scheme has the advantage of keeping down costs. Unfortunately, if the Fibre Channel connection fails, the link with the host server also fails. This situation results in the loss of access to information that can pinpoint a failure. Instead, a technician knows only that something on the rack has failed. Additionally, monitoring data that you need to pass in a crisis can impact data throughput.

The disadvantage of passing data out of band—that is, on a dedicated bus—is that it can be expensive to run another bus between the enclosure that you're monitoring and a host server, especially if the two are physically far away from each other. However, if you are designing a high-availability system, then you need to provide a redundant datapath for reliability anyway. In such cases, having both an in-band and an out-of-band datapath covers most of your bases. Note that unless you're also in control of the software running on the host server, you need to send data in a standard format, such ESI (enclosure-services interface), SAF-TE (SCSI-access fault-tolerant enclosure, usually over SCSI), SES (usually over FC), or IPMI (Intelligent Platform-Management Interface) (see sidebar "Enclosure initiatives").

The intelligence hierarchy

If a sensor cannot handle an event, it passes an alert to a monitor. If a monitor cannot handle the event (speeding up the fan does not sufficiently reduce the temperature), it passes an alert up to the next layer of intelligence. In some cases, this next level is a human being who then decides how to take action. For larger systems, however, it makes sense to add another layer of management through middleware or support software. For example, a monitor might send an alert whenever someone opens the enclosure. Using middleware, the host server can determine whether a service technician has opened the correct box during servicing and flashes LEDs if the technician has made a mistake and is about to pull out the wrong card.

Middleware also handles mapping out a system by taking care of issues such as node discovery or hot insertion. If you add a drive to a system, you have to monitor that drive as well. Without the appropriate software, you must remember to correctly configure a monitor to recognize the new drive. However, with the appropriate software, initializing monitoring of the new drive may be automated. This approach also enables a network administrator to view device environmental status with the same logical map he or she used to monitor device performance.

Given the simplicity of sensors, you have little room to differentiate a product at that level. However, significant room exists to differentiate how to collect, process, and act on that data. Enclosure-management support software and middleware play an important role in managing the monitoring system and aggregating multiple systems under one management umbrella. For example, statistics collected over time can reveal patterns of failure that you can address as they start to appear. Middleware also decides how to handle alerts and how, if necessary, to contact the right person, such as a network administrator, in an appropriate way. Additionally, middleware can hand over data to the application layer for display formatting suitable for you to access.

With enclosure-management support software, you must decide whether you want off-the-shelf support or a proprietary package. Proprietary software requires you to invest in creating and then supporting the middleware. One advantage is that you then have complete control over what you monitor and how you process that information; for example, you don't have to figure out elaborate ways of working around a protocol that doesn't allow you to do exactly what you want to do. You can also provide important features, such as automatic configuration, that off-the-shelf middleware doesn't support. If your software supports only your devices, this may be an advantage in that it forces your customers to buy only your products if they want comprehensive enclosure management. However, supporting your own devices may also lock you out of certain markets. One compromise is to support other vendors' equipment, but you should also be able to provide extra features when you work with your equipment. If you don't want to deal with creating middleware, then you need to design your overall enclosure-management scheme to support protocols and features that off-the-shelf middleware supports.

Who monitors the monitors?

For maximum reliability, a system needs a redundant enclosure-management subsystem. Such redundancy can extend to a level as low as primary and backup sensors. Alternatives to supporting primary and backup sensors include placing sensors so that they overlap in coverage or implementing a method for testing individual sensors on an automated and regular basis. At the highest level, redundancy demands independent monitoring loops, which include distinct datapath buses to a host server.

Most important, redundant enclosure-management subsystems monitor each other. Both subsystems are connected, usually over an I2C bus, and monitor each other through a shared "heartbeat." A heartbeat can be as simple as a pulse or polled request that each subsystem responds to to let the other subsystem know that it is still active. If the heartbeat fails, the appropriate enclosure-management subsystem takes over monitoring the entire system and runs diagnostics on the failed loop to locate the problem.

To be truly robust, the enclosure-management subsystem must be able to survive a complete system failure. For example, the subsystem must have its own power supply; an independent means for sending alarms outside the system, such as through a direct phone connection because the system cannot transfer alerts; and a means for describing the problem and its location. For example, if the host server is down, middleware is unavailable to process data for human access; a Web server can be useful for such a contingency. Web servers are also useful because they allow a technician to directly plug into the enclosure and retrieve data rather than having to trek back and forth to another computer or interpret a confusing bank of flashing LEDs (see sidebar "Considerations for building enclosures").

Enclosure management is a tool that prevents costly or catastrophic failure. Some of the prevention measures may seem extreme, depending on the application; however, the consequences of downtime may make them absolutely necessary.


For more information...
When you contact any of the following manufacturers directly, please let them know you read about their products in EDN.
Agilent
1-650-752-5000
www.agilent.com
Analog Devices
1-781-329-4700
www.analogdevices.com
Mylex (a business unit of IBM)
1-510-796-6100
www.mylex.com
National Semiconductor
www.nsc.com
Pentair
1-401-732-3770
www.pentair-ep.com
Philips
1-800-234-7381
www.semiconductors.philips.com
QLogic
1-949-389-6000
www.qlogic.com
Summit Microelectronics
1-408-378-6461
www.summitmicro.com
Triscend
1-650-968-8668
www.triscend.com
Vitesse
1-805-388-3700
www.vitesse.com
Winbond Electronics Corp America
1-408-943-6666
www.winbond-usa.com
IPMI and ICMB resources
http://developer.intel.com/design/servers/ipmi/


Author Information
You can reach Technical Editor Nicholas Cravotta at 1-510-558-8906, fax 1-510-558-8914, e-mail ednnick@pacbell.net.


Acknowledgment
Special thanks to Mardy Marshall, director of Core Systems at WaterCove Networks (www.watercove.com), for his contributions to this article.

 

Considerations for building enclosures

Temperature sensors that trigger on a high threshold reduce the amount of monitoring a sensor requires; the sensor notifies a controller only when conditions are abnormal, thus freeing up the controller for other tasks. A sensor that also triggers on a low threshold can determine whether a fan is working harder than it needs to and prompt the controller to reduce the fan's duty cycle to increase the life of the fan.

You should always bias outputs that trigger interrupts active-high to conserve power because the output holds almost 100% of the time in the inactive state. This point is especially important for systems that fail over to a reliably redundant power supply, such as a battery.

One characteristic that differentiates high-end monitors is the number of pins the monitor has, which directly correlates with the cost of the device. The range of monitors supports a few to more than 50 general-purpose I/Os. However, you can increase the number of general-purpose I/Os by using a general-purpose-I/O expander. These devices give you as much as an additional 64 general-purpose I/Os at the cost of an I2C bus to communicate with the extender.

It might be tempting to overmonitor a system, given the relatively low cost of sensors. However, for centralized monitoring, sensor information passes over pins on the backplane. You also need a pin for each LED you want to fire up or for each fan you want to control, as well as two for each I2C bus and any out-of-band buses you support. You may go through pins faster than you expect. One way to reduce pin count is to aggregate several sensors onto one I2C bus.

Bus capacitance, however, limits the number of devices you can place on an I2C bus, depending on how you lay out your boards, attach edge connectors, and arrange other features. Typically, you can run four to six devices without a problem. If you need more devices or the ability to expand your device in the future, you need to support multiple I2C buses, which you then bridge together before feeding them to a host controller.

A monitor doesn't necessarily need direct access to a phone line to call alerts over a modem. Some monitors support modem dial-up through a serial port, an intelligent chassis-management bus, an I2C, or a LAN, enabling one modem—or two if you want redundancy—to serve multiple monitors. To be useful, your bus has to survive failure conditions.

A real-time clock allows you to time-stamp events, which simplifies logging of events and enables middleware to accurately determine the duration of an event while alerts travel up the messaging hierarchy. You can use a local oscillator to source a clock, but then you need to include an occasional recalibration of the clock in your management scheme.

If the monitor vendor doesn't provide software or drivers to support temperature sensors, fan tachometers, PWMs to control the fans, power-fault sensing, and other features, you have to write them. Although such software is simple, it includes many nuances, such as adjusting the speeds of the other fans to compensate for a dying fan until a technician replaces it, which you may discover only through the painful lessons of experience.

A sensor by any other name

In a complex system, you may have access to environmental information that doesn't come directly from a sensor. For example, NEBS compliance requires that equipment can monitor and recover from an earthquake or a shock. (Note that the equipment doesn't have to run during an actual earthquake but must be able to resume operation afterward.) One method of recognizing a shock condition uses an accelerometer to detect motion. However, some shock-resistant hard drives protect themselves by shutting down when they detect shock, such as when the head flies off its tracks and the drive stops trying to read. By monitoring the hard drive and recognizing the way it shuts down and resumes operation, you can use the drive as a shock sensor. In this case, you can use the behavior of a system component to monitor a specific environmental condition without having to provide a separate sensor.

Enclosure initiatives

The IPMI (Intelligent Platform Management Interface) initiative describes how boards and components within a system or box can exchange status information. The initiative acknowledges a hierarchy of components and a datapath from a baseband-management controller through an application-programming interface, allowing interaction with system hosts and management middleware. The baseband-management controller can be a stand-alone controller, such as a monitor or a host processor, that handles enclosure-management tasks. Also, part of the initiative is the ICMB (intelligent chassis-management bus), which provides an out-of-band datapath between baseband-management controllers and host-management controllers. Note that you can use IPMI without ICMB and that you can use IPMI to bridge in-band data streams, such as those over Fibre Channel, I2C, and proprietary protocols.

IPMI helps with the management of large banks of servers, SANs, JBODs, and other devices by defining how messages are sent between components. Additionally, by providing a standard messaging protocol, equipment from multiple vendors can coexist in the same box, at least in terms of enclosure management. IPMI also defines methods for communicating with the outside world. Version 1.5, for example, supports communication via modem to remote servers for event logging.

Despite that PICMIG has chosen IPMI for enclosure management for CompactPCI, few vendors have actually implemented IPMI. It's easier for some vendors to use existing proprietary protocols, and changing protocols offers few advantages. Other vendors actively differentiate themselves by using proprietary protocols so that they can provide advanced features and functions. However, many vendors simply are unaware of IPMI.

Acronyms

BMC: baseband-management controller

ESI: enclosure-services interface

I2C: Inter-IC bus or two-wire serial

ICMB: intelligent-chassis-management bus

IPMI: intelligent-platform-management interface

JBOD: just a bunch of disks

NEBS: Network Equipment-Building System

PICMIG: PCI Industrial Computer Manufacturers Group

SAF-TE: SCSI-access fault-tolerant enclosure

SAN: storage-area network

SES: SCSI enclosure services

SOC: system on chip

UART: universal asynchronous receiver-transmitter

Just can't get enough I/O

One of the difficulties with using a standard 8-bit microcontroller for enclosure management is the limited I/O and interfaces of many of these devices. For example, many 8-bit devices support one I2C bus but require the microcontroller to do a lot of the bit-banging in software. If you want to run two I2C buses for redundancy, you have to use two 8-bit devices and support a bridge between them with accompanying complexity in software. Some system configurations use a third I2C bus to handle memory devices. You may also want to logically separate sensors or run different bus trees because the I2C bus can choke if it carries too much traffic. On the I/O side, given the small package of 8-bit micros, not enough pins are available to handle as many inputs as you might like. Extending I/O channels through the use of another chip increases system complexity either through a chip specifically designed for increasing I/O access or through the use of an FPGA.

The E5 Configurable SOC family from Triscend addresses I/O and interface limitations by providing configurable-system-logic cells. Triscend based the family on an accelerated 40-MHz 8051 core running four clocks per instruction versus a standard of 12 clocks. Standard interfaces include a UART, a two-channel DMA, and a 32-bit address-memory-interface unit. You can use the configurable-system-logic cells to create accelerators to stretch the capabilities of the 8051, such as those required for multiplication or bit manipulation, but they are tailored mainly for extending external interfaces. A royalty-free library offers a variety of peripherals, including slave and master/slave I2C buses. Cell counts range from 256 to 2400 cells; a multimaster I2C, for example, one that allows other masters on the same I2C bus, requires around 300 cells. Prices for the E5 start at $7 (10,000).

Triscend also offers a version of the family with a 32-bit ARM microprocessor. You might consider this product line even if you don't need the extra processing power, depending on your team's relative familiarity with 8-bit code and ARM-programming tools. However, if you already use an 8051, then you probably can port over most of your existing code. In such a case, you may be able to reduce chip count and cost. One factor to consider with the Triscend part is that sensor and fan drivers are currently unavailable, so you have to write your own software.

 

Chassisity belt

Many companies have realized the hard way that the easiest way to steal data is for someone to walk into the server room and rip a hard drive right out of a rack. One method that protects against such an attack is intrusion management, an approach in which a sensor detects whether someone has opened the system enclosure and sends an alert whenever it detects that the enclosure is open. If someone is servicing the system, you can ignore the alert. Otherwise, the alert means someone who shouldn’t be interacting with the system, including a service technician opening the wrong enclosure, is interacting with it, giving you the chance to interrupt the intrusion.

Intrusion management can be as simple as a magnetic sensor that triggers when someone opens the enclosure. A drawback of sensing only the door is that someone can come from behind or from the side without triggering the sensor. Optical sensors are good for sensing intrusion, because they trigger whenever light strikes the sensor. One problem with optical sensors, however, is that as heat and power issues continue to increase, so do the number of venting holes that let in light.

For systems without enclosures, you can sense whether individual components are correctly in place. You can also aggregate all intrusion sensors onto a single interrupt, but doing so eliminates your ability to determine the exact location or subsystem that an intruder removes. In any case, if you decide to support intrusion management, make sure it has a fast path to a processor that can quickly take appropriate action.

Remote sensing

One example of a temperature sensor is the remote-diode sensor. The diode senses temperature by comparing the voltage performance of the diode when a monitor forces a 1x and 10xV voltage through the diode. As the temperature rises, the delta voltage increases.

Remote-diode sensors are nifty because you can easily put them anywhere on a board, usually near a known hot spot. In a notebook, for example, you might use three such diodes to monitor the main processor, the graphics chip, and the hard drive. Some devices, such as some of the Pentium processors, have embedded diodes for this purpose. Some monitors have integrated controllers for the diodes, so you have only to run a plus and minus line to the diode. However, you need to avoid noise injection over the diode tracings. Thus, if you design your board and then consider adding enclosure management, you might find that even a simple operation, such as reading a temperature, becomes complicated.

The problems with getting in band

One issue with in-band data transfer is that the transceiver on the enclosure-management controller probably doesn’t run at the high bus speeds of the pipe it uses to send data. Using a lower speed transceiver keeps down the cost of the enclosure-management controller but requires that the controller run asynchronously over the bus. In such a case, a host mastering the bus would suspend data flow over the bus to attach to the controller, dump information, and then disconnect. Given that polling might occur as infrequently as every 10 seconds and that the monitor sends limited data, the actual impact on data throughput over the bus is minimal during typical operating conditions. Additionally, an integrated transceiver can monitor the data bus and collect statistics.

With integrated transceivers, it is difficult for enclosure-management vendors to keep controllers updated as bus speeds increase—for example, as Fibre Channel moves from 1 to 2 Gbps. Instead, the enclosure-management controller can use a bus such as ESI. Some Fibre Channel drives support ESI, so the controller can plug into the drive and transfer data in band, using the Fibre Channel drive as a bridge. (The Fibre Channel drive buffers and inserts data from the ESI bus onto the Fibre Channel bus.) ESI is not as efficient a bus as an integrated transceiver, but it provides a means for transferring data in band over buses that enclosure-management controllers don’t yet support. One problem with this architecture, however, is that it requires a drive with ESI, a component you may be unable to guarantee will be in the final system. Additionally, you might ship the enclosure subsystem without drives, so you cannot configure or test the monitoring subsystem until the customer installs such a drive.



ADVERTISEMENT

ADVERTISEMENT

Feedback Loop


Post a CommentPost a Comment

There are no comments posted for this article.

Related Content

 

By This Author


ADVERTISEMENT

Knowledge Center



Technology Quick Links

EDN Marketplace


©1997-2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites