On the Design and Validation of Fault Containment Regions in Distributed Communication Systems
Doctoral thesis, 2004
This thesis has a two fold focus where the first is an evaluation of a time-triggered communication protocol implementation, TTP-C1, which was stressed by use of heavy-ion fault injection. The gathered result showed a novel type of failures, earlier known only from theory, so-called slightly-out-of-specification faults, that manifested as Byzantine failures with resulting system inconsistence. The second focus lies on the design and simulation of algorithms for RedCAN switches to dynamically reconfigure a CAN bus.
The thesis is divided into three parts which deals with the design, validation and analysis of dependable communication with respect to the above mentioned focus.
Part I deals with the design of mechanisms for fault containment. We propose one algorithm to handle slightly-out-of-specification faults in time domain as they manifested during heavy-ion fault injections in TTP-C1 implementation. We present a novel simulation tool to test and execute the scenarios that lead to serious communication degradation using two different algorithms.
A second algorithm that we propose handles a distributed recovery approach after permanent bus and node failures in a CAN communication system using RedCAN switches.
Part II presents results from heavy-ion fault injection experiments in a TTP-C1 cluster consisting of four to nine synchronized nodes. This was done to be able to evaluate the performance of the time-triggered protocol and assess the efficiency of the implemented dependability increasing mechanism in presence of faults.
Part II furthermore presents performance results of different RedCAN recovery algorithms collected through a novel simulation tool, RedCAN simulation manager that we have designed. Part II additionally includes validation result of a proposed solution for isolating asymmetric faults in a time-triggered system through an active star coupler.
Part III contains analyses of the results given from presented validation results and real world experiences, especially Byzantine faults in a distributed communication system are discussed.
fault handling and membership agreement
fault containment regions