On Efficient Measurement of the Impact of Hardware Errors in Computer Systems
Doctoral thesis, 2017

Technology and voltage scaling is making integrated circuits increasingly susceptible to failures caused by soft errors. The source of soft errors are temporary hardware faults that alter data and signals in digital circuits. Soft errors are predominately caused by ionizing particles, electrical noise and wear-out effects, but may also occur as a result of marginal circuit designs and manufacturing process variations. Modern computers are equipped with a range of hardware and software based mechanisms for detecting and correcting soft errors, as well as other types of hardware errors. While these mechanisms can handle a variety of errors and error types, protecting a computer completely from the effects of soft errors is technically and economically infeasible. Hence, in applications where reliability and data integrity is of primary concern, it is desirable to assess and measure the system's ability to detect and correct soft errors. This thesis is devoted to the problem of measuring hardware error sensitivity of computer systems. We define hardware error sensitivity as the probability that a hardware error results in an undetected erroneous output. Since the complexity of computer systems makes it extremely demanding to assess the effectiveness of error handling mechanisms analytically, error sensitivity and related measures, e.g., error coverage, are in practice determined experimentally by means of fault injection experiments. The error sensitivity of a computer system depends not only on the design of its error handling mechanism, but also on the program executed by the computer. In addition, measurements of error sensitivity is affected by the experimental set-up, including how and where the errors are injected, and the assumptions about how soft errors are manifested, i.e., the error model. This thesis identifies and investigates six parameters, or sources of variation, that affect measurements of error sensitivity. These parameters consist of two subgroups, those that deal with systems characteristics, namely, (i) the input processed by a program, (ii) the program's source code implementation, (iii) the level of compiler optimization; and those that deal with measurement setup, namely, (iv) the number of bits that are targeted in each experiment, (v) the target location in which faults are injected, (vi) the time of injection. To accurately measure the error sensitivity of a system, one needs to conduct several sets of fault injection experiments by varying different sources of variations. As these experiments are quite time-consuming, it is desirable to improve the efficiency of fault injection-based measurement of error sensitivity. To this end, the thesis proposes and evaluates different error space optimization and error space pruning techniques to reduce the time and effort needed to measure the error sensitivity.

fault Injection

bit-flip errors

error sensitivity

soft errors

efficiency

Room EA, Rännvägen 6 (EDIT-building), Chalmers
Opponent: Professor Henrique Madeira, University of Coimbra, Portugal

Author

Behrooz Sangchoolie

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

On the Impact of Hardware Faults – An Investigation of the Relationship between Workload Inputs and Failure Mode Distributions

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),; Vol. 7612(2012)p. 198-209

Paper in proceeding

A Study of the Impact of Bit-flip Errors on Programs Compiled with Different Optimization Levels

10th European Dependable Computing Conference, EDCC 2014; Newcastle upon Tyne; United Kingdom; 13 May 2014 through 16 May 2014;,; (2014)p. 146-157

Paper in proceeding

A Comparison of Inject-on-Read and Inject-on-Write in ISA-Level Fault Injection

11TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE,; (2016)p. 178-189

Paper in proceeding

A Study of the Impact of Single Bit-Flip and Double Bit- Flip Errors on Program Execution

Computer Safety, Reliability, and Security. SAFECOMP, September 24-27.,; (2013)

Paper in proceeding

One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors

The 47th IEEE/IFIP International Conference on Dependable Systems and Networks,; (2017)p. 97-108

Paper in proceeding

Light-Weight Techniques for Improving the Controllability and Efficiency of ISA-Level Fault Injection Tools

Proceedings of IEEE Pacific Rim International Symposium on Dependable Computing, PRDC,; (2017)p. 68-77

Paper in proceeding

Computer systems are becoming increasingly sensitive to failures caused by different types of errors. One of the dominant types of these errors is known as soft errors. These errors are caused as a result of temporary hardware faults that alter data in computer systems.

Modern computers are equipped with a range of hardware and software based mechanisms for detecting and correcting soft errors, as well as other types of hardware errors. While these mechanisms can handle a variety of errors and error types, protecting a computer completely from the effects of soft errors is technically and economically infeasible. Hence, in applications where reliability and data integrity is of primary concern, it is desirable to assess and measure the system's ability to detect and correct soft errors. Examples of these applications are the ones used in automotive, avionic, and nuclear power industries where failures in their applications could result in loss of life or damage to the environment.

This thesis is devoted to the problem of measuring error sensitivity of computer systems. We define error sensitivity as the probability that a soft error results in an erroneous system output. The complexity of computer systems makes it extremely demanding to assess the effectiveness of error handling mechanisms analytically. Therefore, error sensitivity is in practice determined experimentally by means of fault injection experiments. The basic approach of fault injection is to artificially insert errors into a system to enable an analysis of the system's behavior in the presence of errors.

The error sensitivity of a computer system depends not only on the design of its error handling mechanism, but also on the program executed by the computer. In addition, measurements of error sensitivity is affected by the experimental set-up, including how and where the errors are injected. This thesis identifies and investigates six parameters that affect measurements of error sensitivity. These parameters consist of two subgroups, those that deal with systems characteristics, namely, (i) the input processed by a program, (ii) the program's source code implementation, (iii) the level of compiler optimization; and those that deal with measurement setup, namely, (iv) the number of errors that are introduced into the system in each experiment, (v) the location in which errors are injected, (vi) the time of injection.

To accurately measure the error sensitivity of a system, one needs to conduct several sets of fault injection experiments by varying different parameters. As these experiments are quite time-consuming, it is desirable to improve the efficiency of fault injection-based measurement of error sensitivity by reducing the time and effort needed to measure the error sensitivity. To this end, the thesis proposes and evaluates different techniques that reduce the number of experiments needed to measure the error sensitivity of computer systems.

Subject Categories

Computer Engineering

ISBN

978-91-7597-557-3

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 4238

Publisher

Chalmers

Room EA, Rännvägen 6 (EDIT-building), Chalmers

Opponent: Professor Henrique Madeira, University of Coimbra, Portugal

More information

Created

5/9/2017 7