On Efficient Measurement of the Impact of Hardware Errors in Computer Systems
Doctoral thesis, 2017
fault Injection
bit-flip errors
error sensitivity
soft errors
efficiency
Author
Behrooz Sangchoolie
Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)
On the Impact of Hardware Faults – An Investigation of the Relationship between Workload Inputs and Failure Mode Distributions
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 7612(2012)p. 198-209
Paper in proceeding
A Study of the Impact of Bit-flip Errors on Programs Compiled with Different Optimization Levels
10th European Dependable Computing Conference, EDCC 2014; Newcastle upon Tyne; United Kingdom; 13 May 2014 through 16 May 2014;,;(2014)p. 146-157
Paper in proceeding
A Comparison of Inject-on-Read and Inject-on-Write in ISA-Level Fault Injection
11TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE,;(2016)p. 178-189
Paper in proceeding
A Study of the Impact of Single Bit-Flip and Double Bit- Flip Errors on Program Execution
Computer Safety, Reliability, and Security. SAFECOMP, September 24-27.,;(2013)
Paper in proceeding
One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors
The 47th IEEE/IFIP International Conference on Dependable Systems and Networks,;(2017)p. 97-108
Paper in proceeding
Light-Weight Techniques for Improving the Controllability and Efficiency of ISA-Level Fault Injection Tools
Proceedings of IEEE Pacific Rim International Symposium on Dependable Computing, PRDC,;(2017)p. 68-77
Paper in proceeding
Modern computers are equipped with a range of hardware and software based mechanisms for detecting and correcting soft errors, as well as other types of hardware errors. While these mechanisms can handle a variety of errors and error types, protecting a computer completely from the effects of soft errors is technically and economically infeasible. Hence, in applications where reliability and data integrity is of primary concern, it is desirable to assess and measure the system's ability to detect and correct soft errors. Examples of these applications are the ones used in automotive, avionic, and nuclear power industries where failures in their applications could result in loss of life or damage to the environment.
This thesis is devoted to the problem of measuring error sensitivity of computer systems. We define error sensitivity as the probability that a soft error results in an erroneous system output. The complexity of computer systems makes it extremely demanding to assess the effectiveness of error handling mechanisms analytically. Therefore, error sensitivity is in practice determined experimentally by means of fault injection experiments. The basic approach of fault injection is to artificially insert errors into a system to enable an analysis of the system's behavior in the presence of errors.
The error sensitivity of a computer system depends not only on the design of its error handling mechanism, but also on the program executed by the computer. In addition, measurements of error sensitivity is affected by the experimental set-up, including how and where the errors are injected. This thesis identifies and investigates six parameters that affect measurements of error sensitivity. These parameters consist of two subgroups, those that deal with systems characteristics, namely, (i) the input processed by a program, (ii) the program's source code implementation, (iii) the level of compiler optimization; and those that deal with measurement setup, namely, (iv) the number of errors that are introduced into the system in each experiment, (v) the location in which errors are injected, (vi) the time of injection.
To accurately measure the error sensitivity of a system, one needs to conduct several sets of fault injection experiments by varying different parameters. As these experiments are quite time-consuming, it is desirable to improve the efficiency of fault injection-based measurement of error sensitivity by reducing the time and effort needed to measure the error sensitivity. To this end, the thesis proposes and evaluates different techniques that reduce the number of experiments needed to measure the error sensitivity of computer systems.
Subject Categories
Computer Engineering
ISBN
978-91-7597-557-3
Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 4238
Publisher
Chalmers
Room EA, Rännvägen 6 (EDIT-building), Chalmers
Opponent: Professor Henrique Madeira, University of Coimbra, Portugal