|
||||
March 2, 1998Alienating software bugsThe success of the Pathfinder mission to Mars was no accident. It resulted from careful planning, design, and execution by a large and diverse team of engineers, technicians, and operators. On the other hand, several problems could have disrupted the mission and limited Pathfinder and Sojourner's usefulness. The recent Embedded Muse newsletter by EDN columnist and embedded-engineer extraordinaire Jack Ganssle highlighted a Usenet exchange that explained some of the software problems--related to priority inversion on real-time systems--and how they were debugged and solved. Though some of you may already have seen or heard about this situation, the lessons it contains are worth repeating. Pathfinder was beset by a bug that caused occasional total system resets, resulting in a loss of the meteorological data the craft was collecting from the Martian surface. Fortunately, Pathfinder's software contained several debugging features that were left in, even though the data they produced were too voluminous to send back to Earth because of a "fly-what-you-test-and-test-what-you-fly" philosophy, according to Glenn Reeves, a JPL software engineer. Having been bitten by the bug, the JPL engineers went into the lab and were able to reproduce the failure within about 18 hours without changing any software. The problem resulted from an infrequent, low-priority, data-collection task that launched a high-priority bus-management task to publish its data that locked out other bus accesses for the duration of the task by acquiring a mutual-exclusion lock. Interrupts could schedule additional bus tasks behind the high-priority one. Although this approach works well most of the time, if an interrupt occurs at the right time in the sequence, the interrupt could sneak a medium-priority but long-running communications task onto the schedule while the high-priority bus task was waiting for the low-priority data-collection one. When this scenario happens, the communications task pre-empts the data-collection task, and the bus-management task is blocked, eventually setting off a watchdog timer that would, on the theory that something was drastically wrong, reset the entire system. A JPL engineer found the problem in the lab early one morning using a mode in VxWorks--the Pathfinder's OS--which allows trace/logging. Once the problem was understood, the fix became obvious, though its impact was carefully and fully studied. Specialized software on the spacecraft al-lowed uploading the software patch that let the high-priority bus-management task inherit the low-priority of the calling data-collection task. Although this priority-inversion problem had reared its head once before the launch, JPL's engineers couldn't get it to recur and simply ran out of time to debug it. It occurred with higher frequency on Mars because the equipment performed far better than even JPL's best-case expectations, allowing the engineers to more heavily load the equipment than they had during testing. The lessons of this exercise are many. You need to be sure you understand--in detail--how your design works, even at performance extremes. More important, because you can't possible know or consider every possible scenario, make sure you build hooks or features into your software that will help you upgrade it when problems do occur. |
||||
|
||||
EDN Access | Feedback | Table of Contents | |
||||
| Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc. | ||||