On Efficient Measurement of the Impact of Hardware Errors in Computer Systems
Technology and voltage scaling is making integrated circuits increasingly susceptible to failures caused by soft errors. The source of soft errors are temporary hardware faults that alter data and signals in digital circuits. Soft errors are predominately caused by ionizing particles, electrical noise and wear-out effects, but may also occur as a result of marginal circuit designs and manufacturing process variations.
Modern computers are equipped with a range of hardware and software based mechanisms for detecting and correcting soft errors, as well as other types of hardware errors. While these mechanisms can handle a variety of errors and error types, protecting a computer completely from the effects of soft errors is technically and economically infeasible. Hence, in applications where reliability and data integrity is of primary concern, it is desirable to assess and measure the system's ability to detect and correct soft errors.
This thesis is devoted to the problem of measuring hardware error sensitivity of computer systems. We define hardware error sensitivity as the probability that a hardware error results in an undetected erroneous output. Since the complexity of computer systems makes it extremely demanding to assess the effectiveness of error handling mechanisms analytically, error sensitivity and related measures, e.g., error coverage, are in practice determined experimentally by means of fault injection experiments.
The error sensitivity of a computer system depends not only on the design of its error handling mechanism, but also on the program executed by the computer. In addition, measurements of error sensitivity is affected by the experimental set-up, including how and where the errors are injected, and the assumptions about how soft errors are manifested, i.e., the error model. This thesis identifies and investigates six parameters, or sources of variation, that affect measurements of error sensitivity. These parameters consist of two subgroups, those that deal with systems characteristics, namely, (i) the input processed by a program, (ii) the program's source code implementation, (iii) the level of compiler optimization; and those that deal with measurement setup, namely, (iv) the number of bits that are targeted in each experiment, (v) the target location in which faults are injected, (vi) the time of injection.
To accurately measure the error sensitivity of a system, one needs to conduct several sets of fault injection experiments by varying different sources of variations. As these experiments are quite time-consuming, it is desirable to improve the efficiency of fault injection-based measurement of error sensitivity. To this end, the thesis proposes and evaluates different error space optimization and error space pruning techniques to reduce the time and effort needed to measure the error sensitivity.