Multi-Layer Fault Tolerance for Distributed Real-Time Systems
Licentiatavhandling, 2007
This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems are increasingly deployed in automotive and avionics applications. We focus on the design and validation of fault tolerance mechanisms.
From the design viewpoint, we develop the notion of multi-layer fault tolerance. A fault-tolerant distributed system contains a set of mechanisms that provide error detection and recovery. Those mechanisms can be structured into three different layers, based on where they are implemented and what parts of the system they involve. Circuit layer mechanisms provide the basic fault tolerance implemented in hardware; node layer mechanisms are executed locally in computer nodes; and system layer techniques involve multiple computer nodes to prevent faults from disturbing the system.
We make a probabilistic modeling analysis to compare federated to integrated architectures. Federated architectures have few or no fault tolerance mechanisms at the node layer and a node is the elementary unit of failure; integrated architectures provide robust partitioning mechanisms at the node layer in order to ensure that individual tasks are the unit of failure. We compare the reliability of the two architectures and propose a set of guidelines for building integrated architectures.
The thesis also addresses the problem of distributed redundancy management. We propose a group membership protocol to achieve consensus on the operational state of all nodes. The protocol is based on the principle that each message sent by a node in the membership is acknowledged by k other nodes, in a system with n nodes. Agreement on node departure is guaranteed if no more than f=k-1 failures occur during n consecutive transmission slots. Additionally, we provide a solution for the reintegration of restarted nodes in the membership. This protocol is part of the system layer of fault tolerance mechanisms.
We address the validation of fault tolerance mechanisms by fault injection. This thesis describes an automated analysis technique to reduce the cost of fault injection campaigns. The analysis uses knowledge of program flow and resource usage to eliminate faults that have no possibility of activation. Our experimental results show that the fault-spaces are reduced by several orders of magnitude, when compared with the usual random approach.