Looking for drift in all the wrong places
Curt Wilson - January 10, 2013
As a young engineer in the late 1980s, I joined a small company that produced digital motion controllers. The digital technology had gotten capable enough and cheap enough that it could be used in many applications that had previously used purely mechanical controls or analog electronic controls. Furthermore, with advances in the cost-effectiveness of position sensors such as digital optical encoders, position control could be employed in applications that earlier had only used velocity control with analog tachometers.
In this small company, I had to be a jack of all trades, and one of the hats I wore was that of an applications support engineer, handling customer problems. We had recently gotten our controller designed into an application running a conveyorized assembly line that produced some of the first compact fluorescent light bulbs. The engineers who set up the line were very pleased with its operation, as it achieved accuracies and speeds that they had been unable to obtain before.
When the line was put into full production, however, a key problem appeared. As an automated line, it was intended for 24-hour, three-shift operation. But when it was employed in this continuous fashion, after working perfectly through the day and evening, it would invariably crash in the middle of the graveyard shift, at about 2 am. Since there was only a skeleton crew on that shift, the line would remain down for the rest of the night. When the engineers came in at about 8 am, they would reset everything, and production would restart. This became the daily pattern.
Of course, because the company was regularly losing a quarter of its potential production time, the nightly crash quickly became a high-profile problem, and resources were allocated to find the cause. The company assigned skilled and experienced technicians to monitor the system overnight and look for the source of the problem. Several said they had seen problems of this sort before and had usually traced them back to analog drift issues. The dominant theory at the outset was that the factory temperature went down overnight, and the drop put some analog circuits out of whack. (The amplifiers in this system were still analog, though driven by a digital controller.)
Many days went by in a fruitless search for a drift problem. Analog scope traces could find no drift in key signals before the crash. Even when the thermostats were to maintain the same temperature overnight as during the day, with that condition confirmed through independent measurements, the crash still occurred each night.
I was asked to see what information I could provide from the controller to get at the source of the problem. As a digital controller, it had data-logging capability that I thought could be useful.
This capability, though advanced for its day, was very primitive by today’s standards. For one thing, memory was very expensive then, so it could only store less than a minute’s worth of data at the sample rates needed to be able to see the problem. Furthermore, the function had been designed for short, discrete events, such as servo tuning moves. When the available memory buffer filled, it simply stopped.
To make this function useful for the problem at hand, I had to write some low-level code to spoof the buffer’s status bits and storage pointers. When the buffer was close to the end from logging the key motion registers, my code checked to see whether there were any error conditions in the motion algorithms. If not, I overwrote the storage pointer so it addressed the start of the buffer again. If an error had occurred, I simply stopped the gathering, so there was a record of about the last minute before the failure. Basically, I had created a black-box recorder for the system.
We did a few “single shot” runs of the logging buffer while the system was working well to get some reference plots for comparison, then set it up to run continuously overnight. That first night, the failure occurred as usual, but now we had good data about the lead-up to the failure.
In the morning, we uploaded the logged data and plotted it, and it was immediately evident, in a head-slapping, “why didn’t we think of this before” way, what the source of the problem was.
The plots showed that the system was working perfectly right up until the instant of the failure. The failure occurred when the 32-bit position register for the conveyor reached its maximum positive value and rolled over to its maximum negative value. At the operational speed for the conveyor, which was always moving in the “positive” direction, this took about 18 hours from startup at zero position, which was why it always occurred in the middle of the night, and why it had never been seen in the shorter runs that proved out operation of the line.Within a couple of minutes, we had ascertained that the math in the application software was not prepared to handle this rollover. In a few more minutes, we had figured out the required fix and applied it, and the problem never recurred.
Looking back at the problem, I see that the old hands (to whom I had initially deferred) were actually held back in finding the problem by their experience on different types of systems. They were very focused on the types of problems that occurred in analog velocity control-systems, not recognizing the potential for very different sources of problems in a digital positioning system.
Fortuitously, this experience also sensitized me to the “failure on rollover” software issue a dozen years before the Y2K scare, so I was well prepared to recognize what would and would not be a problem there.
Curt Wilson is vice president of engineering and research at Delta Tau Data Systems, an industrial controls company. He holds bachelor of science and master of science degrees in mechanical engineering from Stanford University.