Node-Level Fault Tolerance for Embedded Real-Time Systems
This thesis deals with cost-effective design and validation of fault tolerant distributed real-time systems. Such systems play an increasingly important role in embedded applications such as automotive and aerospace systems. The cost of fault-tolerance is of primary concern in these systems, particularly for emerging applications like micro-satellites, unmanned air vehicles and active safety systems for road vehicles.
We address cost issues of fault tolerance from both a design and a validation perspective. From a design perspective, we investigate cost-effective techniques that can make systems more resilient to transient hardware faults. We propose a two-level approach to achieve fault-tolerance that combines system-level and node-level fault tolerance. Our approach relies on nodes that mask the effects of most transient faults and exhibit omission or fail-silent failures for permanent faults and transient faults that cannot be masked by the node itself. As only a subset of the faults is tolerated at the node level, we call this approach /light-weight node-level fault tolerance/, or light-weight NLFT. Tolerating transient faults at the node level is important in systems that rely on duplicated nodes for fault tolerance, as it allows the system to survive transient faults also when one of the nodes have failed permanently. It also improves the robustness of the system when both nodes are affected by correlated or near coincident transient faults. We have implemented a real-time kernel that supports light-weight NLFT through time redundant execution of tasks and the use of software implemented error detection. The effectiveness of light-weight NLFT is evaluated both analytically and by extensive fault injection experiments.
The thesis also deals with the cost of fault tolerance from a validation perspective. Fault injection based validation of error handling mechanisms is a time consuming and costly activity. We present a fault injection tool that can be easily extended and adapted to different target systems making fault injection less time-consuming. We also propose an analytical technique for investigating how error coverage varies for different input sequences to a system. The analysis helps us identify interesting activation patterns, e.g., those that give extremely low, or high, error coverage.
analytical coverage estimation
distributed real-time systems