Design FeaturesMarch 3, 1997 |
Karl H Pflueger, IBM Germany
The true path to high reliability is a well-managed process that starts in the product's definition phase and continues long after the first shipment.
The design of highly reliable power systems is an ongoing challenge. Power-supply reliability does not happen by chance, and even the choice of the proper topology and adequate deratings provides no guarantees. You improve reliability by attacking the real detractors at all phases of product development.
Specifically, implementing an aggressive design-for-reliability philosophy eliminates the shipment of many potential problems and increases the shipped products' reliability. This path to improving reliability is not simply derating or using more expensive components, but a process that starts with the definition of the product and continues through manufacturing and field use. Rather than trying to "test in" reliability, you need to aggressively challenge all areas in an attempt to maximize reliability. Use experience from past problems, and try to use solid engineering judgment to avoid new problems (see box, "The essential steps for high reliability").
Not surprisingly, such a well-managed process takes cooperation that transcends job titles, departments, and companies. Communication and close cooperation in the beginning stages are important to reliability. A joint effort among development, evaluation, and production personnel with the expressed goal of reliability can often lead to trade-offs that yield a more reliable product.
Performance prediction has its perils
Modern switch-mode power supplies incorporate a wide component-technology mix and, hence, have many potential failure mechanisms. Specifying a reliable power supply is rather easy, but predicting the actual performance is another matter. Traditional failure-rate prediction methods, such as MIL-217, yield pessimistic numbers. However, it is well-known that a power supply's performance can be orders of magnitude greater than these paper results.
Traditional failure-rate improvement efforts concentrate on derating parts or using higher quality parts. These efforts stem from the belief that the intrinsic failure rate (IFR) is the fundamental problem. For the exponential model of failures, the IFR is the sum of the individual failure rates. Thus, reducing the individual rates reduces the overall failure rate. However, although derating and component quality can influence reliability, traditional methods of derating are useless unless you also address other factors.
Manufacturers and users perform extensive testing to find failure mechanisms and establish confidence in the ability of a supply to meet a reliability objective. Some people assume that passing rigorous stress tests ensures reliability, but such efforts are only part of the reliability picture. Many experienced power engineers can cite problems that escaped designer, evaluation, and manufacturing personnel. These problems are not necessarily the result of using the wrong topology or inadequate derating. Thus, a candid look at where and why failures occur is warranted, and a general process for achieving high reliability is extremely useful.
First, be clear on terminology
Confusion sometimes exists about reliability terminology; specifically, confusion regarding the difference between the terms "failure rate" or "mean time between failures" (MTBF) on the one hand and useful or operational lifetime on the other. A common definition of "reliability" is the ability of an item to perform a required function under stated conditions for a stated period of time. Electronic systems, such as power systems, are designed to operate for a specified period of time at a stated failure rate. Both of these parameters--operational lifetime and failure rate during that lifetime--precisely determine the reliability of a product.
For a large sample, which
applies mostly for power supplies, you can also define
reliability as a function of failure rate over lifetime. You can
make a graph of this function, which results in a curve with the
shape of a bathtub (Figure 1). This curve consists of:
an early-life or infant-mortality period, in which the failure
rate is decreasing; a steady-state period, in which the failure
rate is constant; and a wearout period, in which the failure rate
is increasing.
Once you realize that this bathtub curve applies to most products, you can easily establish some ground rules to achieve highly reliable products. Some common-sense ground rules include cutting the early-life period before product shipment, ensuring that the failure rate during useful life or operating time is as low as possible, and ensuring that the useful-life period never expands into the wearout period.
Thus, a truly reliable product shows neither any early-life nor any wearout behavior, and the failure rate is constant and low during the useful life. The more common term, MTBF, is the reciprocal of the failure rate. With a constant failure rate, the MTBF also is constant. To clearly define the reliability objective for a product, you need to specify MTBF and useful life (both in hours).
At first glance, the fact that the MTBF is larger than the useful lifetime can be confusing. For example, a product's MTBF can be 500,000 hours; its useful life, 50,000 hours. These numbers mean that during the 50,000 hours of useful lifetime, this product is performing with an MTBF of 500,000 hours, which equals a failure rate of 0.2%/1000 hours of operation. Also, these numbers mean that, during that useful life period of 50,000 hours, approximately 10% (0.2%/1000 hours times 50,000 hours) of the total sample size eventually fails.
Start at the beginning
The power-supply specification dictates many factors that ultimately determine the cost and reliability of the power supply. Too often, the specification disagrees with the actual requirements or includes features that are inessential. These items can degrade reliability (and increase cost) without adding to the real value or function of the supply. Estimating minimum and maximum load currents can be the most important factor for cost and reliability of the power supply. Inflated requirements can result from a lack of product definition, poor load specification from IC manufacturers, or added "fat" to cover unforeseen problems.
Aggressive and accurate specification reviews in the beginning of reliability testing, however, can reduce component cost and, theoretically, increase reliability. Constructive specification reviews with product designers, power-supply designers, evaluation engineers, and the manufacturing division can aid in ensuring that the power-supply specification reflects the optimum requirements necessary.
Another area designers frequently overlook when beginning is specification of the electromechanical aspects. Air-moving devices, mechanical switches, covers, and connectors strongly influence reliability. Experiences in IBM Germany's Power Systems group show that, for many products, the packaging and electromechanical concerns are the gating factors on system reliability. You should give adequate early consideration to choosing proven packaging, cooling, and connector systems.
The following list includes practical and common-sense hints. Many field problems arise because of lack of a attention in one of these areas. Of course, sound design and de-ratings are assumed.
By design, offline switchers should be able to tolerate large deviations in line voltage without damage. A supply designed to meet specified limits (90 to 137V ac, for example) is usually insufficient. The power supply is most likely subject to severe transients that can easily damage a supply without an appropriate safety margin. Decoupling and protecting ICs is critical. The power supply combines large-energy-level, noisy circuits with sensitive op amps and comparators. Common problems are negatively biased gates and noise glitches.
Specifying quality parts and limiting component sources are essential prerequisites for highly reliable parts. Also, do not depend on unspecified or variable-device parameters for reliability. The equivalent series resistance of capacitors is an example of a parameter that varies with time and temperature and that can affect loop gain and ripple voltage.
Follow power-supply evaluation
Many OEM companies have independent groups to evaluate power-supply performance. The power supply must meet myriad functional, safety, and reliability requirements. Failure to detect problems at this stage can lead to expensive consequences.
A typical power supply contains numerous potential failure mechanisms. Although there is no substitute for attention to detail in the design, further testing is most often warranted to ensure that latent failure modes do not exist in the design or, equally important, in the process. IBM's strategy is to willingly exceed specified limits to determine design margins and to induce failures. Inducing failures helps to find weak or susceptible areas of the design. Engineering judgment is always necessary to determine if a valid impact to reliability exists.
IBM uses the following tests with high success. The process is intended not as a rigorous, analytical one, but as a general model for power-supply evaluation. You must analyze failures and use sound engineering when determining if a failure constitutes a reliability problem. The problem is that you can't allow each new power supply to be "field tested" for reliability. You have to identify potential weak areas and correct them before the product ships. Satisfactory completion of the tests demonstrates a rugged design.
Thermal shock/cycling--Performing these tests to accelerate mechanical and electrical failures frequently uncovers material incompatibility or differences in expansion coefficients.
Temperature stress test--Step the power supply in 58 increments until its temperature is 108 above the stated life-test temperature. Subsequent "soaking" at this temperature for 24 hours helps to eliminate any major thermal problems and ensures a margin for life test.
Vibration--Vibrate samples on two axes to expose weaknesses in connectors, components, and solder joints. Testing under power-on conditions helps to catch intermittent problems. Solder joints that provide electrical and mechanical support for large components can be insufficient for reliable operation. This test is effective on the same sample used for thermal shock/cycling.
EMI--EMI can cause serious reliability problems. The power supply can act as a source of system noise, or it can be susceptible to EMI that other products generate. As system clocks and power switching frequencies increase to megahertz, you must give special consideration to EMI. EMI-related problems are often intermittent and may not show up on the bench. To minimize the risk of EMI problems, you must perform adequate tests at the power-supply level and in the final product's environment.
Thermal imaging--A thermal-imaging camera system provides a good overall thermal profile of the supply and can identify layout problems or overstressed components. The new cameras are easy-to-use, and experience shows that they correlate well with thermocouples. The thermal image allows you to evaluate the effectiveness of cooling and to avoid "shadowing" effects from heat sinks and transformers. A thermal-imaging system can detect problems caused by an overlooked snubber resistor or poor heat-sink mounting and can detect potential layout concerns.
Stress analysis--Measure voltage, current, and temperature stress under worst-case conditions. Although these tests can be time-consuming on a complex supply, automated data-collection systems considerably speed matters. This approach ensures that every component is analyzed at least once in its lifetime. This test can find the component that a designer added for snubbing or clamping and that was never actually measured.
Power-line disturbances--Apply power-line surges and drops outside specified limits to the input. The front end of an offline switcher is often subject to a hostile real world. The supply should operate without failure, even with significant deviations from nominal conditions. Utility companies' power-factor-correction circuitry can induce high-amplitude ringing on the line. In addition, lightning strikes can generate short but high-amplitude spikes.
Lifetime evaluation--Lifetime is a critical issue that is frequently overlooked but that you need to address during design. Electrolytic capacitors, on/off switches, blowers, and ICs have finite lifetimes. Electrolytic capacitors and electromechanical devices are common culprits.
Test to specification--Testing a supply under all operating conditions is fundamental. Increasing the sample sizes to ensure that component tolerances cause no problems is important. The use of automated test systems to collect data, such as ripple, hit voltages, and overcurrent trip levels, can shorten test time but still allow for testing of greater sample sizes. Many times, manufacturers obtain data on limited samples and mistakenly assume the same data holds for the general population.
Testing for reliability
It is interesting to note that as power-supply complexity is increasing, the reliability targets are sharply increasing, too. Ten years ago, a typical power supply consisted of a ferroresonant transformer and filter. Such a supply used only a few parts and had a specified failure rate of 1%/1000 hours. Assuming that the exponential model applies, this value translates to a 100,000-hour MTBF.
Now, a comparable product may consist of 250 parts, but the reliability targets are perhaps 0.1%/1000 hours, or 1 million hours MTBF. Paper-calculation techniques never support the demanded failure rates. However, most people would agree that a well-designed power supply can achieve much better results than MIL-217 or other techniques predict.
IBM's standard practice is to test a statistically relevant sample size (normally 25 to 50 pieces) for two to four months at an increased ambient temperature of 658C and with a nominal load. The actual conditions vary, depending on the needs of the customer. These samples have already been subjected to extensive evaluation, so testers expect no failures. In addition to MTBF verification, you also want to avoid wearout of the supply. High temperature is an excellent catalyst for many wearout mechanisms that could limit the useful life.
Some of the problems you can find during reliability testing (sometimes referred to as "life test") are poor solder joints, overstressed components, fan failures, connector problems, internal-component bonding, material incompatibility, and capacitor wearout.
Some of these problems are by no means exotic or technically challenging. However, they can still cause failures. Furthermore, each can contribute to poor performance in the field. The customer does not care if a problem is caused by a simple loose connector or some second-breakdown phenomenon. The result is downtime. As a result, attention to details is critical to realize high reliability. However, successful completion of the life test is one more step on the way to product reliability.
Maintain reliability in production
You could argue that the gating factor of a well-designed power supply's reliability is the quality of production. Manufacturing indeed contributes to most field problems. Some production problems are poor solder joints caused by solder-process changes or assembly errors, substitution of unapproved components, poor cleaning techniques, reversed polarized capacitors, and shipping damage.
Multiple sourcing of components continues to cause problems. Most power-supply vendors establish several component sources for cost and availability. Second-order effects, such as recovery times and leakages, and a quality concern often show up in the field. Limiting sources and ensuring adequate component qualification minimize exposure to these problems. None of the above problems is complex, and all are readily correctable. However, selecting a vendor with established and quality production is fundamental to long-term reliability.
| The essential steps for high reliability |
Reliability can't be solely
designed-in, tested-in, or built-in. It takes a team
effort, starting with the definition and specification of
the product, and doesn't end with the first-customer
shipment. The following points are key to designing,
producing, and shipping a high-reliability power supply:
|
Author's biography
Karl H Pflueger heads the reliability lab at IBM Germany's Power Systems Division (Mainz, Germany). His team is responsible for the qualification of power supplies with an emphasis on reliability evaluation. He is also program manager for OEM power projects.
| EDN Access | Feedback | Subscribe to EDN | Table of Contents |