7 tips for creating a reliable embedded system
Despite the hopes and dreams of many embedded engineers, reliable code doesn’t happen by accident. It is a painstaking process that requires developers to maintain and manage every bit and byte of the system. There is usually a sigh of relief when an application is validated “successfully” but just because the software is running correctly in that moment under controlled conditions doesn’t mean that it will tomorrow or a year from now.
There are a plethora of techniques for creating a reliable embedded system, ranging from a well-disciplined development cycle through strict implementation and system checking. An entire library could easily be filled with books on reliable software design. But there are seven tips that are easily implemented that can go a long way to ensure that a system performs more reliably and catches unexpected behavior.
Tip #1 - Fill ROM with known value
Software developers tend to be a very optimistic group, at least as far as their expectations of how faithfully their microcontroller will run their code over time. The thought of the microcontroller jumping out of the application space and executing in unintended code space seems like a fairly rare case; however, the opportunity for this to occur is nothing more than a buffer overflow or the dereferencing of a faulty pointer away. It can and DOES happen! The resulting behavior of the system would be undefined since memory could have all 0xFF’s in the space by default or, since the region of memory normally isn’t written, the values could have decayed into only God knows what.
There is a pretty neat linker or IDE trick, though, that can be used to help identify and recover the system from just such an event. The trick is to use the FILL command to fill unused ROM with a known bit pattern. There are many different possible combinations of what can be used to fill the unused memory with but if the intent is to build a more reliable system the obvious choice is to place an ISR fault handler in this location. If something goes wrong and the processor starts to execute code outside of program space then the ISR will fire, providing the opportunity to store the state of the processor, registers and system before deciding on a corrective course of action.
Additional information on how to use FILL and alternative strategies for its use can be found in “Improving Code Integrity Using FILL” located here.
Tip #2 – Check Application CRC
One of the great benefits available to embedded engineers is that our IDE’s and tool chains can automatically generate an application or memory space checksum from which the application can be verified. The interesting thing is that in many of these cases the checksum is used only at the time of loading program code onto a device.
If a CRC or checksum is kept in memory, though, then verifying that the application is still intact at start-up (or even periodically for long running systems) is a great way to ensure that something unexpected won’t occur. Now the chances that a programmed application will change is small, but considering the billions of microcontrollers shipped each year and the possible harsh operating environments the chances of a corrupted application is not zero. Even more likely is that a bug in the system could cause a flash write or flash erase in a sector, resulting in a corrupted application.
Tip #3 – Perform a RAM Check on Start-up
In order to build a more reliable and robust system it is important to ensure that the system hardware is functioning. After all, hardware does fail. (Thankfully software never fails; it just does what it was coded to do, whether right or wrong.) Verifying that there are no issues with internal or external RAM on start-up is a great way to ensure that the hardware is functioning as expected.
There are many different methods that can be used to perform a RAM check but commonly what is done is write a known pattern, allow it to sit for a short period, and then read back. The result should be that what is read matches what was written. The truth is that in most cases the RAM check will pass, which is what we want. But in the off chance that it doesn’t, this check provides an excellent opportunity for the system to flag that there is a hardware issue.
There is a memtest C module that was written back in 2000 by Michael Barr that will save an engineer time when considering a RAM test. The embedded.com link to download the module can be found here.
Tip #4 – Use a Stack Monitor
To a large number of embedded developers the stack seems to be quite the mystical force. When strange things start to happen and the engineer is finally stumped they begin to think, well maybe something is going on with the stack. The result is blind tweaking and adjustments of the stack size, position, etc. Often enough the bug has nothing to do with the stack but how can one really be sure? After all, how many engineers actually perform a worst-case stack size analysis?
The size of the stack gets allocated statically at compile time but it is used in a dynamic way. As code is executed variables, return addresses, and other information that the application needs are stored on the stack. This activity causes the stack to grow within its allocated memory. However, this growth can sometimes exceed the compile-time size limit, causing the stack to corrupt whatever lies in the memory region next door. One way to be absolutely sure that the stack is behaving is to implement a stack monitor as part of the systems health and wellness code (How many engineers do this?). The stack monitor creates a buffer zone between the stack and “other” memory region, filled with a known bit pattern. The monitor then constantly watches the pattern for any changes. If the bit pattern changes then the stack has grown too far and is on the verge of plunging the system into a dark abyss! The monitor can then log the occurrence, system states and any other useful data that can later be used to diagnose the issue.
A stack monitor is often available in most RTOS’s or microcontroller systems that implement a memory protection unit (MPU). The part that is scary is that these are usually capabilities that are either off by default or that can be turned off by the developer, which they too often do. A quick search of the internet reveals recommendations to turn off the stack monitor in an RTOS to save 56 bytes of flash space. Take a moment and reflect on the imprecations!