An Exploration of Models for Software Faults and Errors: a Journey through Field Data and Injection Experiments
Doctoral thesis, 1998
Society is becoming quite dependent on computer-based systems. Today, computers are embedded in wristwatches, vending machines, factory equipment, automobiles and aircraft. A failure in a computer-based system that controls critical applications may lead to significant economic losses or even the loss of human lives. The causes of failures in computer-based systems are manifold: physical faults, maintenance errors, design and implementations mistakes resulting in hardware or software defects, and user or operator mistakes. These causes - faults - are all undesired circumstances that hinder the system from delivering the expected service. There are two complementary ways to ensure that a system delivers the expected service: fault prevention, i.e. avoid the introduction of faults; and fault tolerance, i.e. ensure that the system delivers its service despite the presence of faults. This thesis addresses fault tolerance. A fault-tolerant system should tolerate both hardware and software faults, as both categories can have a great impact on it. Furthermore, it is essential that confidence in a fault-tolerant system's ability is reached if it shall be deployed for critical applications.
One attractive approach to reach confidence in a fault-tolerant system's capability is fault injection. Fault injection can be used for studying the effects of hardware and software faults. However, in both the academic community and industry, most fault injection studies have aimed at the effects of physical hardware faults. Only a few studies have been concerned with software faults, for the reason that knowledge of software faults experienced by systems in the field is limited. As a result, it is difficult to define realistic fault sets to inject. This is crucial if a fault injection experiment is intended to quantify a system's fault tolerance. Consequently, more research is needed in the fault injection area - especially studies targeting software faults and errors induced by them.
This thesis contributes towards fulfilling this need by investigating models of software faults and models of errors induced by software faults. Techniques for emulating representative software faults were also implemented and evaluated. Specifically, an investigation of software faults experienced by a large IBM operating system product was carried out. In addition, general procedures that allow injection experiments to be based on field data were devised and put into practice to emulate software faults in an embedded real-time system.
The novelty of the work of this thesis is the usage of general procedures to construct models of software faults and of software induced errors - on the basis of field data - in injection experiments. No such research has been carried out previously. The combined usage of the procedures and the injection techniques is the major contribution of this thesis. The use of these procedures and techniques is mainly in validation efforts aimed at predicting a fault-tolerant system,s ability to detect and process the effects of software faults.