Assessment and Comparison of Physical Fault Injection Techniques
Doctoral thesis, 1999
This thesis deals with the problem of validating and estimating the effectiveness of error handling mechanisms in computer systems. The main contribution is an assessment of the effectiveness and usefulness of several physical fault injection techniques. The assessment is based on fault injection experiments conducted on the fault-tolerant, distributed, real-time system MARS and the Thor microprocessor. Another key contribution is the validation of the error handling mechanisms included in these systems.
The MARS system was evaluated using heavy-ion radiation, electromagnetic interference and pin level fault injection to allow, for the first time, a direct comparison of physical fault injection techniques. Significant differences in the results obtained by the techniques were observed. The results also showed that hardware based error detection mechanisms are the most effective mechanisms of MARS, but that application level mechanisms can significantly improve the error detection coverage.
The thesis introduces scan-chain implemented fault injection (SCIFI), which provides higher observability and controllability than most other physical fault injection techniques. The SCIFI technique injects faults via the test access port of integrated circuits. Results of SCIFI experiments on the Thor microprocessor are compared with results of simulation based fault injection performed using a highly detailed VHDL model of Thor. The comparison show that the SCIFI technique can be more than 100 times faster than simulation based fault injection, and yet produce similar results.
Additional SCIFI experiments on Thor show that the estimated error coverage may vary by more than five percentage units for different workload input sequences. A methodology for predicting the error coverage for various input sequences based on fault injection experiments with a specific input sequence is presented. Although the accuracy of the predicted values is limited, the methodology is able to find input sequences with high, medium or low error coverage.
fault tolerance
concurrent error detection
dependability
coverage
fault injection
experimental validation
boundary scan