Multi-Layer Fault Tolerance for Distributed Real-Time Systems
Licentiatavhandling, 2007

This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems are increasingly deployed in automotive and avionics applications. We focus on the design and validation of fault tolerance mechanisms. From the design viewpoint, we develop the notion of multi-layer fault tolerance. A fault-tolerant distributed system contains a set of mechanisms that provide error detection and recovery. Those mechanisms can be structured into three different layers, based on where they are implemented and what parts of the system they involve. Circuit layer mechanisms provide the basic fault tolerance implemented in hardware; node layer mechanisms are executed locally in computer nodes; and system layer techniques involve multiple computer nodes to prevent faults from disturbing the system. We make a probabilistic modeling analysis to compare federated to integrated architectures. Federated architectures have few or no fault tolerance mechanisms at the node layer and a node is the elementary unit of failure; integrated architectures provide robust partitioning mechanisms at the node layer in order to ensure that individual tasks are the unit of failure. We compare the reliability of the two architectures and propose a set of guidelines for building integrated architectures. The thesis also addresses the problem of distributed redundancy management. We propose a group membership protocol to achieve consensus on the operational state of all nodes. The protocol is based on the principle that each message sent by a node in the membership is acknowledged by k other nodes, in a system with n nodes. Agreement on node departure is guaranteed if no more than f=k-1 failures occur during n consecutive transmission slots. Additionally, we provide a solution for the reintegration of restarted nodes in the membership. This protocol is part of the system layer of fault tolerance mechanisms. We address the validation of fault tolerance mechanisms by fault injection. This thesis describes an automated analysis technique to reduce the cost of fault injection campaigns. The analysis uses knowledge of program flow and resource usage to eliminate faults that have no possibility of activation. Our experimental results show that the fault-spaces are reduced by several orders of magnitude, when compared with the usual random approach.

Room EA, Hörsalsvägen 11, Chalmers University of Technology

Författare

Raul Barbosa

Chalmers, Data- och informationsteknik, Datorteknik

Assembly-Level Pre-injection Analysis for Improving Fault Injection Efficiency

Proceedings 5th European Dependable Computing Conference (EDCC-5), 2005, Budapest.,; (2005)p. 246-262

Artikel i vetenskaplig tidskrift

Flexible, Cost-Effective Membership Agreement in Synchronous Systems

12th Pacific Rim International Symposium on Dependable Computing, 18-20 December 2006, Riverside, California, USA,; (2006)p. 105-112

Paper i proceeding

Ämneskategorier

Datorteknik

Technical report L - Department of Computer Science and Engineering, Chalmers University of Technology and Göteborg University: 39L

Room EA, Hörsalsvägen 11, Chalmers University of Technology