Intel researchers suggest new approach to error detection in software execution
The problem of detecting when execution of code is not following the path you had in mind has been an issue since the dawn of the electronic computer. In the really good old days—please don’t ask me why I know this first-hand—the Burroughs 205 computer, a mighty colossus with rotating drum memory, vacuum-tube logic and 9-track tape for on-line storage, had a CRT on the operator’s console. The address bus for the drum memory was split into high- and low-order almost-bytes, and each byte was fed through a DAC to the deflection amplifier on an axis of the CRT. The result was a striking visual display of the sequence of addresses fetched by the executing program—a trace, drawn out in two dimensions. Skilled operators could watch the CRT and tell at once if a familiar program—often the Algol compiler was the culprit—had wandered into never-never land.
The problem is somewhat more challenging for today’s multiprocessor server chips. With clocks in the GHz instead of the kHz, and with the potential for several processors working simultaneously on closely-coupled threads, just discovering that something has gone wrong can be a serious undertaking, even in retrospect. Finding out in time to prevent damage is even harder.
And the stakes are getting higher. In network servers, for example, a deviation from expected execution most likely means an attack from an intruder has succeeded, and the next thing that will happen will be the planting of a Trojan Horse or other pernicious act. Prompt response is vital.
In the near future, the problem will be just as critical in the embedded world, as multiprocessing begins to replace dedicated hardware accelerators in mission-critical applications such as engine management, vehicle safety and attitude control, and robotics. (At this point I may want to stop driving, but that’s another story.) Here, a deviation in the expected trajectory of the software is likely due either to a sensor out of limits or a bug. In either case, the result is likely to be an undesired trajectory for a very fast, heavy piece of equipment in near proximity to humans. Again, early intervention is vital.
Such thoughts have caused two researchers at Intel, software engineer Michael Ryan and research scientist Shimin Chen, to suggest a novel way of detecting these excursions. The two described and demonstrated their concept at the Intel Developers’ Forum last week.
The idea is not unlike—again reaching into uncomfortably distant history—what in-circuit emulators used to do in the early days of microprocessors. Ryan and Chen propose inserting a hardware recording and monitoring circuit into the instruction retirement engine of a CPU: the point at which the instruction’s decoded op-code and effective addresses are most readily available. The unit would record each op-code and address, compress this data using an on-the-fly lossless decompression algorithm, and transmit it to a FIFO-like execution-log memory.
A task on a second CPU then traverses the log, perhaps with its own hardware assistance, and uses something not unlike a routing table to determine whether execution is remaining within expected limits. Ideally, this inspector would be able to continuously test the log contents against a set of assertions designed to detect errors before they could have consequences. If something does go wrong, the second CPU can use the log information to in effect rewind the failed task to before the error, repair damage and start over.
At this point the work is being done only in simulation, and there has not been a great deal of work on the implications for hard real-time systems, just servers. But the researchers are already doing feasibility studies on adding the necessary hardware to an Intel CPU and trying it out in reality. Given the growing importance of the problem, it might be a very good use of a little silicon real estate.
Bing Huang commented:















