Layered Fault Tolerance for Distributed Embedded Systems
This thesis deals with principles and techniques of fault tolerance for distributed embedded systems. A layered approach is taken to achieve high dependability by structuring error detection and recovery mechanisms into three layers. The first layer consists of mechanisms implemented in hardware, either at the circuit or the micro-architectural level. Many integrated circuits, especially microprocessors, are provided with such mechanisms in order to mask transient hardware faults and to detect permanent ones. To prevent software faults and hardware faults not captured at the hardware layer from causing node failures, it is desirable to introduce node-layer mechanisms. While they may depend on hardware support such as memory protection, they are mostly implemented in software. For this second layer, the thesis proposes techniques for building robust operating systems, addressing software and hardware faults in a comprehensive manner. The goal is to guarantee the integrity of tasks in a multithreaded environment by preventing undesired interactions among tasks and by providing them with recovery services. Some of these techniques were added to an existing real-time kernel and assessed experimentally. To this end, an experimental platform, with an associated fault injection tool, was developed. Following a methodology for fault removal, the tool revealed two design flaws in the kernel extension. Even though the goal of node-layer mechanisms is to make computer nodes highly dependable, nodes may still fail. This motivates the development of system-layer mechanisms that can deal with node failures. Accordingly, the thesis investigates methods for distributed redundancy management and proposes a protocol for guaranteeing consistent diagnosis of node failures in synchronous systems. Due to its importance as a building block, the protocol was formally verified using model checking. An important goal of the proposed framework and the associated node-layer and system-layer mechanisms is to reduce the cost of fault tolerance in distributed embedded systems.