On the Design and Validation of Fault Containment Regions in Distributed Communication Systems
Doktorsavhandling, 2004

This thesis has a two fold focus where the first is an evaluation of a time-triggered communication protocol implementation, TTP-C1, which was stressed by use of heavy-ion fault injection. The gathered result showed a novel type of failures, earlier known only from theory, so-called slightly-out-of-specification faults, that manifested as Byzantine failures with resulting system inconsistence. The second focus lies on the design and simulation of algorithms for RedCAN switches to dynamically reconfigure a CAN bus. The thesis is divided into three parts which deals with the design, validation and analysis of dependable communication with respect to the above mentioned focus. Part I deals with the design of mechanisms for fault containment. We propose one algorithm to handle slightly-out-of-specification faults in time domain as they manifested during heavy-ion fault injections in TTP-C1 implementation. We present a novel simulation tool to test and execute the scenarios that lead to serious communication degradation using two different algorithms. A second algorithm that we propose handles a distributed recovery approach after permanent bus and node failures in a CAN communication system using RedCAN switches. Part II presents results from heavy-ion fault injection experiments in a TTP-C1 cluster consisting of four to nine synchronized nodes. This was done to be able to evaluate the performance of the time-triggered protocol and assess the efficiency of the implemented dependability increasing mechanism in presence of faults. Part II furthermore presents performance results of different RedCAN recovery algorithms collected through a novel simulation tool, RedCAN simulation manager that we have designed. Part II additionally includes validation result of a proposed solution for isolating asymmetric faults in a time-triggered system through an active star coupler. Part III contains analyses of the results given from presented validation results and real world experiences, especially Byzantine faults in a distributed communication system are discussed.

fault handling and membership agreement

Byzantine faults

asymmetric faults

fault injection

time-triggered protocol

fault containment regions

slightly-out-of-specification faults

validation

Författare

Håkan Sivencrona

Chalmers, Institutionen för datorteknik

Ämneskategorier

Data- och informationsvetenskap

ISBN

91-7291-378-9

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 2060

Technical report D - School of Computer Science and Engineering, Chalmers University of Technology: 23