Node-Level Fault Tolerance for Embedded Real-Time Systems
Doktorsavhandling, 2004

This thesis deals with cost-effective design and validation of fault tolerant distributed real-time systems. Such systems play an increasingly important role in embedded applications such as automotive and aerospace systems. The cost of fault-tolerance is of primary concern in these systems, particularly for emerging applications like micro-satellites, unmanned air vehicles and active safety systems for road vehicles. We address cost issues of fault tolerance from both a design and a validation perspective. From a design perspective, we investigate cost-effective techniques that can make systems more resilient to transient hardware faults. We propose a two-level approach to achieve fault-tolerance that combines system-level and node-level fault tolerance. Our approach relies on nodes that mask the effects of most transient faults and exhibit omission or fail-silent failures for permanent faults and transient faults that cannot be masked by the node itself. As only a subset of the faults is tolerated at the node level, we call this approach /light-weight node-level fault tolerance/, or light-weight NLFT. Tolerating transient faults at the node level is important in systems that rely on duplicated nodes for fault tolerance, as it allows the system to survive transient faults also when one of the nodes have failed permanently. It also improves the robustness of the system when both nodes are affected by correlated or near coincident transient faults. We have implemented a real-time kernel that supports light-weight NLFT through time redundant execution of tasks and the use of software implemented error detection. The effectiveness of light-weight NLFT is evaluated both analytically and by extensive fault injection experiments. The thesis also deals with the cost of fault tolerance from a validation perspective. Fault injection based validation of error handling mechanisms is a time consuming and costly activity. We present a fault injection tool that can be easily extended and adapted to different target systems making fault injection less time-consuming. We also propose an analytical technique for investigating how error coverage varies for different input sequences to a system. The analysis helps us identify interesting activation patterns, e.g., those that give extremely low, or high, error coverage.

fault injection

error recovery

reliability calculations

time redundancy

real-time kernels

error detection

analytical coverage estimation

distributed real-time systems

Författare

Joakim Aidemark

Chalmers, Institutionen för datorteknik

Ämneskategorier

Data- och informationsvetenskap

ISBN

91-7291-530-7

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 2212

Technical report D - School of Computer Science and Engineering, Chalmers University of Technology: 34