Advances on Adaptive Fault-Tolerant System Components: Micro-processors, NoCs, and DRAM
The adverse effects of technology scaling on reliability of digital circuits have made the use of fault tolerance techniques more necessary in modern computing systems. Digital designers continuously search for efficient techniques to improve reliability, while keeping the imposed overheads low.
However, unpredictable changes in the system conditions, e.g. available resources, working environment or reliability requirements, would have significant impact on the efficiency of a fault-handling mechanism. In the light of this problem, adaptive fault tolerance (AFT) techniques have emerged as a flexible and more efficient way to maintain the reliability level by adjusting to the new system conditions. Aside from this primary application of AFT techniques, this thesis suggests that adding adaptability to hardware component provides the means to have better trade-off between achieved reliability and incurred overheads. On this account, hardware adaptability is explored on three main components of a multi-core system, namely on micro-processors, Networkson-Chip (NoC) and main memories. In the first part of this thesis, a reliable micro-processor array architecture is studied which can adapt to permanent faults. The architecture supports a mix of coarse and/or fine-grain reconfiguration. To this end, the micro-processor is divided into smaller substitutable units (SUs) which are connected to each other using reconfigurable nterconnects. Then, a design-space exploration of such adaptive micro-processor array is presented to find the best trade-off between reliability and itsoverheads, considering different granularities of SUs and reconfiguration options. Briefly, the results reveal that the combination of fine and coarse-grain reconfiguration offers up to 3 more fault tolerance with the same overhead compared to simple processor level redundancy.
The second part of this thesis, presents RQNoC, a service-oriented NoC that can adapt to permanent faults. Network resources are characterized based on the particular service they support and, when faulty, they can be bypassed through two options for redirection, i.e. service merging (SMerge) and/or service detouring (SDetour). While SDetour keeps lanes of different services isolated, suffering longer paths, SMerge trades service isolation for shorter paths and higher connectivity. Different RQNoC configurations are implemented and evaluated in terms of network performance, implementation results and reliability. Concisely, the evaluation results show that compared to the baseline network, SMerge maintains at least 90% of the network connectivity even in the presence of 32 permanent network faults, which is more than double versus SDetour, but will impose 51% more area, 27% more power and has a 9% slower clock.
Finally, the last part of this thesis presents a fault-tolerant scheme on the DRAM memories that enables the trade-off between DRAM capacity and fault tolerance. We introduce Odd-ECC DRAM mapping, a novel mechanism to dynamically select Error-Correcting Codes (ECCs) of different strength and overheads for each allocated page of a program on main memories. Odd-ECC is applied to memory systems that use conventional 2D, as well as 3D stacked DRAMs and is evaluated using various applications. Our experiments show that compared to flat memory protection schemes, Odd-ECC reduces ECCs capacity overheads by up to 39% while achieving the same Mean Time to Failure (MTTF).
Adaptive fault tolerance