
Programs contain two major paths: a forward path that does the work and a reverse path that rolls back the work when the program detects errors. Typically, these paths are so tightly bound together that both paths are difficult to read. Code that is difficult to read results in code that is difficult to write, debug, enhance, and reuse.
For example, in object-oriented programming, you cannot reuse objects as much as you might want, primarily because the objects are tightly bound together at the error-handling level. Many times, error code even gives clues about how a program implements an object.
The solution is to handle errors in programs as you would in a database-transaction recovery mechanism. A database transaction is a unit of work that involves one or more operations on a database. For example, the operation of inserting data into a database could be a transaction if it's the only operation performed. If you combine the insertion with an update, the program considers both operations as one transaction. In a database transaction, the transaction either executes in its entirety, or, if an error appears in any of its operations, the transaction totally cancels as if it had never executed. If an error appears, the program automatically rolls back all work to the beginning of the transaction.
When a development team first introduces transaction error handling to a project, many engineers resist it because it requires the removal of IF statements after calls to functions. Engineers also believe the technique makes debugging more difficult. However, after seeing how much easier it is to read and write transaction error-handling code, the resistance fades. In addition, transaction error handling decreases debugging time to a little less than that of the traditional method. The reason for this decrease is probably that transaction error handling has less embedded error-handling code, causing defects to stand out more. Also, when you add error-handling code, you do so in the structured way that most engineers like to worka method that disturbs very little of an already-debugged program.
Software developers are often dismayed at how difficult commercial programs are to maintain and design, compared with programs they developed in school. The reason for this may be that the programs students develop in school are "toys" that assume perfect inputs and that the hardware has unlimited memory and disk space. In addition, most software engineers have very little formal training in error-handling methods. Typically, software developers learn error handling by example or by trial and error, and they use the traditional error-handling model: Check for an error, find an error, and return an error code.
Many formal design processes, such as structured analysis and structured design, recommend that developers ignore errors during design because such errors are an implementation detail. However, this "minor" detail can take up to one-third of the code in commercial programs. This code appears not just around algorithms but directly in the middle of the algorithms. The resulting programs are difficult to read, debug, and reuse.
In addition to existing design methodologies, such as structured analysis and structured design, is exception handling, or error handling. This programming style separates most of the error-handling processes from the main algorithms. Error handling comprises four main parts: detection, correction, recovery, and reporting. The main focus of this article is error recovery.
In this context, the term "error" does not refer to a defect but to an exception that an algorithm cannot handle. A "defect," on the other hand, is an error that strains the design limits of an entire application or a system. For example, many algorithms assume that there is unlimited memory. Insufficient memory for the algorithm to complete successfully constitutes an error, and you must design the whole application to handle these out-of-memory errors. A defect occurs when an application that does not handle these errors causes a program to halt or to behave in an undocumented way. In other words, whether something is an error or a defect depends on what level of the software hierarchy you are observing.
Mixed forward and reverse path problem
Fig 1 shows the two major paths in commercial programs. The forward path does the work for which a program is designed. The reverse path is the error-handling code that keeps the forward path working correctly. It does this by detecting and fixing problems and rolling back partially completed work to a point at which the algorithm can again continue forward.
An intermediate function in a program has to stop what it is doing in the middle of the algorithm because the program called a function that cannot complete its task. This can lead to "tramp errors," a term similar to the "tramp-data" term of structured analysis and structured design (Ref 1).
Tramp errors in functions do not directly relate to the current function but are the result of a real error occurring in a lower-level function. For example, function A() calls function B(). Function B() needs some memory, so it calls the malloc() memory-allocation function. The malloc() function returns an out-of-memory error. This is a real error for the malloc() function. Function B() does not know how to get more memory, so it has to stop and pass the error back to function A(). From the perspective of function B() and probably function A(), an out-of-memory error is a tramp error.
Tramp errors prevent functions from becoming "black boxes." For example, function A() (above) now knows something about how function B() works. In other words, tramp errors form a part of error recoverynot error detectionbecause if the program could immediately correct real
errors, tramp errors would not occur. Because of tramp errors, almost every function has to handle errors that lower-level functions generate. This buck-passing can cause tight data coupling, which makes code reuse more difficult.
Unreadable code and poor reuse
Mixed forward and reverse paths and tramp errors combine to obscure the main forward path of the program, which is doing the real work. The correction and recovery parts of error handling are the main areas that obscure the code. Most of the code for detection and reporting can be in separate functions, so these components of error handling play less of a role in obscuring code than do the other two.
You can solve the problems of unreadable code and poor reuse by separating the forward and reverse error-processing paths and by using context-independent error codes. This method of error handling is very similar to the way databases handle error recovery. Transactions control the rollback process when a group of database operations cannot complete successfully.
The traditional defensive way of programming is to assume that a function may have failed to complete its task, resulting in a lot of error-handling code to check for the errors and to roll back partially completed work, as Fig 1 shows. Now, assume the reversethat returning functions have successfully completed their tasks. In this scenario, if the function or one of the functions it calls has errors, the function passes processing control to a programmer-defined recovery point. In other words, the programmer defines transaction points so that if there are any problems, the work rolls back to those points, and the processing again proceeds. With this approach, you do not need to check for errors after each function call, and tramp error-detection code does not clutter the forward path.
Context-independent error codes provide more information than just an error number. They also provide information such as which function generated an error, the state that caused the error, the recommended correction, and the error severity. This information allows the program to correct the error in a location separate from the forward processing path.
Programmers usually encode contexts of errors for error-reporting functions. For example, error contexts may include the names of the program, the function, the error type, and the error code. The program saves these parameters to report later. However, programmers rarely use sophisticated encoding schemes because traditional error handling already knows the context of the error: Checking occurs right after a call to the offending function.
With transaction error handling, the recovery process is separate from the forward processing path, necessitating the use of context-independent error codes. This may involve creating unique error codes across a whole application or system (with the codes bound at compile time). An alternative would be to assign code ranges or other unique identifiers to functions at runtime.
Code readability and reuse
The transaction error-handling approach makes programs easier to read because the reverse-processing paths are visually separate from the forward-processing paths. This method makes possible creating some general error-recovery interfaces so that functions (modules or objects) connect only loosely at the error-handling level. This loose connection is possible because the there are fewer tramp errors to control the recovery process, and the program needs to handle only the real errors.
Two methods you need for building a transaction error-handling library are transaction-control and transaction-data management. Transaction-control management requires some language support to implement the mechanism that controls error recovery. For example, languages like Hewlett-Packard Co's Pascal-MODCAL have a "try/recover" feature that can support a transaction error-handling style.
For other languages, you must use a "global-goto," or "multithreaded," feature, which allows a lower-level function and all other functions above it to exit to a point you define in a higher-level function without passing error-code flags through all the other functions. In C, you do this with the setjmp and longjmp library routines. The setjmp function saves its environment stack, and longjmp restores that environment. The listings, which are written in C, show how these functions work.
The material in Ref 2 details the new C++ exception-handling feature, which provides an excellent foundation for a transaction-based error handler. The material in Ref 3 also describes how to add C++ error-handling functions to regular C programs. However, overuse of transaction error handling can lead to code that is just as cluttered as the traditional error-handling style. You must design transaction boundaries for objects with the same care that you would to design an object's interface.
If a language is missing a global-goto or multithreaded feature, use macros or other "wrapper" functions to build recovery processes that are mostly invisible. Wrapper functions and macros add functionality to functions that you cannot change, such as library functions.
In building a transaction-handling package, you might want to give it the following features:
Transaction-data management
Recovery involves more than just rolling back functions to undo some intermediate work. It may also involve releasing unneeded memory or changing global variables back to the values they had at the beginning of the transaction.
You can best manage memory using a mechanism similar to the mark/release memory feature in some implementations of Pascal. The mark/release procedures allow dynamic allocation and deallocation of memory in an executing Pascal program. The C functions malloc() and free(), along with a stack of pointers to track the allocated memory, provide the best features for allocating and freeing memory. With these features, you can call a mark function just before the program's transaction starting point to mark the current stack point. If a longjmp() goes to this recovery point, the release function is called to free any memory allocated after the mark point.
To remove pointers from the mark/release stack, you need a commit function at the end of a program transaction. The commit function indicates the successful completion of a transaction in the database context. You must also consider nested transactions, however. A simple solution would be to have each transaction keep its own mark/release stack.
You can roll back global and other static variables with a strategy similar to the one in the memory-management problem. Just before a transaction's beginning point, the program saves on a stack the states of all the globals that might change. This strategy allows you to nest transactions.
Context-independent error codes
The traditional error-handling style of checking error codes after each function call automatically gives errors a context. The transaction error-handling style provides this context information in another way. The biggest challenge in transaction error handling is that error codes alone are not very useful. For example, "97'' could be the letter "a" in ASCII code, the digits "6'' and "1'' in BCD format, index 97 in a message array, the 97th error, an out-of-memory error, a disk-full error, a divide-by-zero error, or another error.
To decode an error code, a program must know the source of the error. Some information that the program may save when an error occurs includes the machine name, the program name, the process number, the module name, the function name, and, of course, the error code. The program needs to send this information only when it must roll back a transaction.
The amount of information that the program saves depends on the location of the transaction-recovery point and on the runtime environment. For example, a client-server application may need more information than does a simple PC application. Each recovery point can usually find higher-level context information fairly easily. For example, the names of the machine, program, module, and function can easily pass down to a lower-level recovery point. However, a program cannot collect lower-level context information because, by the time the program gets down to that level, the function that had the error would no longer be active.
You may want to consider the following points when implementing a transaction error-handling scheme:
Traditional error-handling style
Listing 1, which reads a binary formatted file, is coded with a common error-handling style. The code would have been more cluttered without the aExitErr() and aRetErr() macros to manage the error reporting and recovery. Listing 1 uses the simple error-recovery process: Detect error, report error, and exit. However, notice how much error-handling code is mixed in with the algorithm.
Listings 2 through 5 show an implementation of the transaction error-handling style. Listing 2 performs the same function as the program in Listing 1 but uses the transaction style of error handling. The functions erSet, erUnset, and erRollBack provide the error handling defined in the include file erpub.h. Listings 3 through 5 show the support functions for the transaction error-handling method. In the main body of the algorithm, the code following the recovery sections is clearer than that in the traditional error-handling example, and there is no error-handling or recovery code mixed in with the algorithm.
However, there are some shortcomings in the support modules. For example, most of the macros should be functions, and the program should save the vEnv values in a linked list. Some engineers point out that the transaction implementation of read.c is not really shorter than the traditional implementation of read.c because the error-handling code simply moves from read.c into the support functions. But that is exactly the goal: to remove the error-handling code from most functions and encapsulate the error-handling in common shared code.
The include file epub.h contains wrapper macros that cause the program to call the appropriate transaction error-handling functions instead of the standard library function. For example, when invoking the standard function fclose, the program actually calls the function eClose.
Listing 3 defines macros and global data structures that form a crude error-transaction manager. The macros perform the following operations:
Listing 4 contains wrapper macros that cause the program to call the functions in the file e.c in place of the standard library functions. The functions in e.c behave the same as the standard library functions, but if the error transaction manager is on (erRecOn is true in erpub.h), control passes to the last defined rollback point, rather than just returning the same error code as the associated standard library function.
Using these wrapper macros makes it easier to add transaction error handling to old programs, but if you want to make the error-handling process more visible, have the program call the functions in e.c directly instead of the standard library functions. The file in Listing 4 is also a good place to define context-independent error codes.
The file in Listing 5 contains the implementations of the wrapper macros in epub.h. Listing 5 shows only two of the functions. These functions behave exactly like the standard library functions with the same name because they call the standard library functions. For more flexibility, a real error transaction manager might allow you to define the error codes that determine whether a rollback occurs.
So far, only small programs and enhancements of existing programs have used the transaction error-handling technique. But, its featuresgreater code reuse, greater code supportability, and better quality codemay make it more widespread. Just as you can separate the functional part of algorithms from user interfaces (client/server models), you can also separate error handling from the functional algorithm.
Bruce A Rafnel is a software engineer at the Professional Services Division of Hewlett-Packard Co, Mountain View, CA, where he has worked for 12 years. In his current position, he develops Standard General Markup Language (SGML) applications that help deliver HP's customer-training courses. Rafnel also helped develop the Charting Gallery graphics package for PCs and internal SGML applications, and he helped enhance the VPlus forms manager for HP3000 Unix systems. He earned a BS in computer science at California Polytechnic State University at San Luis Obispo. A member of the IEEE and the C Users Group, Rafnel lists voice-controlled home automation as one of his spare-time interests.
Thanks to Andra Marynowski and Kevin Wentzel, coworkers at Hewlett-Packard, who helped review and refine the ideas in this article, and to King Wah Moberg for providing a number of reviews.