Toyota's killer firmware: Bad design and its consequences
Thousands and thousands
The Camry ETCS code was found to have 11,000 global variables. Barr described the code as “spaghetti.” Using the Cyclomatic Complexity metric, 67 functions were rated untestable (meaning they scored more than 50). The throttle angle function scored more than 100 (unmaintainable).
Toyota loosely followed the widely adopted MISRA-C coding rules but Barr’s group found 80,000 rule violations. Toyota's own internal standards make use of only 11 MISRA-C rules, and five of those were violated in the actual code. MISRA-C:1998, in effect when the code was originally written, has 93 required and 34 advisory rules. Toyota nailed six of them.
Barr also discovered inadequate and untracked peer code reviews and the absence of any bug-tracking system at Toyota.
NASA, which was involved in an earlier investigation, discussed in its report the five fail-safe modes implemented in the ETCS. They comprise three limp-home modes, RPM limiting, and finally, engine shutdown. All fail-safes are handled by the same task. What if that task dies or malfunctions?
Many embedded systems use watchdog timers to rein in errant processors; in safety-critical systems, they're mandatory. But as systems increase in complexity, the watchdog subsystem must mirror that complexity.
Ideally in a multitasking system, every active task should be required to "check in" to the watchdog. In the Toyota ETCS, the watchdog was satisfied by nothing more than a timer-tick interrupt service routine (ISR). A slow tick. If the ISR failed to reset the watchdog, the ETCS could continue to malfunction due to CPU overload for up to 1.5s before being reset. But keep in mind that for the great majority of task failure scenarios, the ISR would continue happily running along without resetting the controller.
It was also found that most RTOS error codes indicating problems with tasks were simply ignored – a definite MISRA-C violation.
Who watches the watcher?
Toyota's ETCS board has a second processor to monitor the first. The monitor CPU is a 3rd-party part, running firmware unknown to Toyota, and presumably developed without any detailed knowledge of the main CPU's code.
This is potentially a good thing, as it would be a truly independent overseer. This chip communicates with the main CPU over a serial link, and also contains the ADC that digitizes the accelerator pedal position.
Anyone working with safe systems knows that single points of failure are to be avoided at almost any cost, yet here is one – the single ADC that feeds both CPUs their vehicle state information.
Also, the failsafe code in this monitor CPU relies on the proper functioning of a main CPU task Barr identified to the jury only as "Task X" (due to secrecy rules surrounding the source code itself), an arguably outsize task handling everything from cruise-control to diagnostics to failsafes to the core function of converting pedal position to throttle angle. Task X could be viewed as another single point of failure.
What can be learned from this story of software gone wrong? Here are some thoughts, inspired by Toyota's experience:
- It all starts with the engineering culture. If you have to fight to implement quality, or conversely, if others let you get away with shoddy work, quality cannot flourish. The culture must support proper peer review, documented rule enforcement, use of code-quality tools and metrics, etc.
- In complex systems, it's impossible to test all potential hardware- and software-induced failure scenarios. We must strive to implement all possible best practices, and use all the tools at our disposal, to create code that is failure-resistant by design.
- Use model-based design where suitable.
- Use tools with the proper credentials, not an uncertified RTOS as was done here.
- The system must undergo thorough testing by a separate engineering team. Never make the mistake of testing your own design. (To be true, Toyota's overall test strategy was not specifically described.)
- The underlying hardware must work with the firmware to support reliability goals:
- Single points of failure, in HW and SW, must be avoided.
- Architectural techniques that contribute to reliability, such as lockstep CPUs, EDAC memory, properly implemented watchdogs, MMU to implement full task isolation and protection, must be implemented.
- A thorough FMEA to characterize failure modes and guide design improvements should be employed.
Are you involved with safety-critical devices? If so, are you satisfied with the quality processes and culture at your company? What are your thoughts on Toyota’s design work and the investigation’s findings?
Below is an interview with Michael Barr after his EE Live! keynote "Killer Apps: Embedded Software's Greatest Hit Jobs".
- The mythical software engineer
- Toyota Underestimated 'Deadly' Risks, EE Live! keynoter says
- Toyota Case: Single Bit Flip That Killed
- Why Toyota’s Oklahoma Case Is Different
- Acceleration Case: Jury Finds Toyota Liable
- Toyota, drive by wire, and our failure to learn from experience
- Toyota learns the tyranny of software complexity
- Toyota fined for accelerator pedal sticking, April 5, 2010
- Toyota accelerations revisited—hanging by a (tin) whisker
- Unintended acceleration and other embedded software bugs
- Firmware forensics: best practices in embedded software source-code discovery
- Dead code, the law, and unintended consequences
- Unintended acceleration
- Tin whisker analysis of an automotive engine control unit (published study)
- NASA Tin Whisker page
- Electrical Failure of an Accelerator Pedal Position Sensor Caused by a Tin Whisker and Discussion of Investigative Techniques Used for Whisker Detection
- How "brake override" stops runaway cars (Consumer Reports video)