Multiple-Bit Errors in Computer Systems
Doctoral thesis, 1997
This thesis discusses the types of transient faults caused by heavy-ion or a-particle radiation that manifest as multiple-bit errors and how information redundancy techniques can achieve fault tolerance with respect to some classes of multiple-bit errors.
The first part of the thesis considers the fault modelling problem, beginning with the problem of how single event upsets (SEUs) originating from faults in register cells in microprocessors manifest as primary errors. This is analysed and discussed on the basis of Californium-252 (Cf-252) fault injection experiments. The conclusion is that a relatively large fraction of the SEUs cause more than one bit to flip. For one of the microprocessors, the type of double error that occurs is explained by a model based on a study of the register file layout.
The thesis then presents an experimental method for investigating the impact of particle-induced transients in combinational logic in CMOS circuits. This new method makes it possible to estimate the probability that such transients propagate into memory elements and to predict whether they will manifest as single-bit errors or as multiple-bit errors. The probability parameters are determined by: i) physical fault injection experiments using heavy-ion radiation from Cf-252, ii) switch-level and circuit-level simulations, iii) general knowledge of the circuit investigated.
The thesis also presents a validation of fault models for the propagation of transients in CMOS circuits. Simulated fault injection and physical fault injection were performed to validate the charge collection model, expressed as a double exponential current pulse against data obtained from the heavy-ion radiation experiments. The results show that this charge collection model does not produce error behaviour in agreement with that observed in physical fault injection experiments.When considering SEUs originating from combinational logic, the analysis raises the questions of whether sensitivity analyses based on this model can be trusted and whether all transient fault simulations must be made on the device level in order to be accurate.
The second part of the thesis considers error-correcting codes detecting multiple-bit errors and still correcting single-bit errors. The contribution to the research area is in memory systems with a b-bit-per-chip organization, where b may be four or eight (or even larger).
The first coding problem investigated in the thesis concerns a memory application with eight data bits and four check bits, and a memory with a four-bit-per-chip organization. The task is to find the best possible error detection capability for such a system, given that a single-bit error correction is required. The solution presented gives all codes that correct all single-bit errors and detect all double-bit errors confined to a memory chip.
The thesis then presents two error detection and correction (EDAC) circuits designed and man ufactured for the European space program. The EDACs are to be used in a memory application with eight check bits and a memory with a four-bit-per-chip or a eight-bit-per-chip organization. The two codes are true SEC-DED-SBD for a four-bit-per-chip memory organization. At the same time, they are SEC-DED-SBD for an eight-bit-per-chip memory organization, given that the chip errors (but not necessarily the bit errors) are permanent.
Last, the thesis presents the design of a class of error-correcting codes where only one b-bit byte of check bits is added to the data bits. The problem studied is what is the best possible error detection capability for such a system, given that a single-bit error correction is required. The codes have the capability of correcting single-bit errors and detecting byte errors in which the erroneous byte has all 0s or all 1s. They also detect gross errors in which all bits in a word have become all 0s or all 1s.
fault sim ulation
single event upsets
physical fault injection