Runtime Management of Multiprocessor Systems for Fault Tolerance, Energy Efficiency and Load Balancing

Stavros Tzilis

Runtime Management of Multiprocessor Systems for Fault Tolerance, Energy Efficiency and Load Balancing
Doctoral thesis, 2019

Efficiency of modern multiprocessor systems is hurt by unpredictable events: aging causes permanent faults that disable components; application spawnings and terminations taking place at arbitrary times, affect energy proportionality, causing energy waste; load imbalances reduce resource utilization, penalizing performance. This thesis demonstrates how runtime management can mitigate the negative effects of unpredictable events, making decisions guided by a combination of static information known in advance and parameters that only become known at runtime. We propose techniques for three different objectives: graceful degradation of aging-prone systems; energy efficiency of heterogeneous adaptive systems; and load balancing by means of work stealing. Managing aging-prone systems for graceful efficiency degradation, is based on a high-level system description that encapsulates hardware reconfigurability and workload flexibility and allows to quantify system efficiency and use it as an objective function. Different custom heuristics, as well as simulated annealing and a genetic algorithm are proposed to optimize this objective function as a response to component failures. Custom heuristics are one to two orders of magnitude faster, provide better efficiency for the first 20% of system lifetime and are less than 13% worse than a genetic algorithm at the end of this lifetime. Custom heuristics occasionally fail to satisfy reconfiguration cost constraints. As all algorithms' execution time scales well with respect to system size, a genetic algorithm can be used as backup in these cases. Managing heterogeneous multiprocessors capable of Dynamic Voltage and Frequency Scaling is based on a model that accurately predicts performance and power: performance is predicted by combining static, application-specific profiling information and dynamic, runtime performance monitoring data; power is predicted using the aforementioned performance estimations and a set of platform-specific, static parameters, determined only once and used for every application mix. Three runtime heuristics are proposed, that make use of this model to perform partial search of the configuration space, evaluating a small set of configurations and selecting the best one. When best-effort performance is adequate, the proposed approach achieves 3% higher energy efficiency compared to the powersave governor and 2x better compared to the interactive and ondemand governors. When individual applications' performance requirements are considered, the proposed approach is able to satisfy them, giving away 18% of system's energy efficiency compared to the powersave, which however misses the performance targets by 23%; at the same time, the proposed approach maintains an efficiency advantage of about 55% compared to the other governors, which also satisfy the requirements. Lastly, to improve load balancing of multiprocessors, a partial and approximate view of the current load distribution among system cores is proposed, which consists of lightweight data structures and is maintained by each core through cheap operations. A runtime algorithm is developed, using this view whenever a core becomes idle, to perform victim core selection for work stealing, also considering system topology and memory hierarchy. Among 12 diverse imbalanced workloads, the proposed approach achieves better performance than random, hierarchical and local stealing for six workloads. Furthermore, it is at most 8% slower among the other six workloads, while competing strategies incur a penalty of at least 89% on some workload.

Fault Tolerance

Algorithms

Runtime Management

Multiprocessors

Performance

Adaptive Systems

Load Balancing

Energy Efficiency

Room EA, Rännvägen 4, Chalmers

Opponent: Professor Luigi Carro, Universidade Federal do Rio Grande do Sul, Brazil

Author

Stavros Tzilis

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

Other publications Research

A runtime manager for gracefully degrading SoCs

Proceedings - IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems,;(2014)p. 216-221

Paper in proceeding

A Probabilistic Analysis of Resilient Reconfigurable Designs

Proceedings - IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems,;(2014)p. 141-146

Paper in proceeding

A dependable coarse-grain reconfigurable multicore array

Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS,;(2014)p. 141-150

Paper in proceeding

The DeSyRe runtime support for fault-tolerant embedded MPSoCs

Proceedings - 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2014,;(2014)p. 197-204

Paper in proceeding

Reducing the performance overhead of resilient CMPs with substitutable resources

Proceedings of the 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFTS 2015,;(2015)p. 191-196

Paper in proceeding

Resilient chip multiprocessors with mixed-grained reconfigurability

IEEE Micro,;Vol. 36(2016)p. 35-45

Journal article

Runtime Management of Adaptive MPSoCs for Graceful Degradation

2016 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (Cases),;(2016)p. Article number 2968517-

Paper in proceeding

Energy-efficient Runtime Management of Heterogeneous Multicores using Online Projection

Transactions on Architecture and Code Optimization,;Vol. 15(2019)

Journal article

SWAS: Stealing Work Using Approximate System-Load Information

Proceedings of the International Conference on Parallel Processing Workshops,;(2017)p. 309-318

Paper in proceeding

Computer systems often have to function in variable conditions that cannot be predicted in advance. As a result, when unpredictable changes happen to these operating conditions, they can cause the efficiency of the computer system to reduce critically. To mitigate this effect, this thesis proposes ways to react to such changes in a manner that minimizes the aforementioned efficiency loss. For example, consider a computer system that consists of multiple computational units (i.e. processors). If one of these processors stops working due to aging of its transistors, the system stops producing results. To avoid this, we design a unit called "runtime manager". The runtime manager is informed of the processor failure and, at that time, decides on a set of steps that allow the system to adapt to the new situation and recover. For instance, the computational duties of the broken processor can now be assigned to other processors that are still working fine. Alternatively, the processors that are still working can be connected in a different manner than before, to be able to cover for the failed processor.

The thesis describes strategies to achieve the runtime management described above for three different types of systems: First, a system like the one described in the previous paragraph, the processors of which can fail at unpredictable times. The objective in this case is to maintain acceptable system operation for as long as possible, despite the failures. Second, a system that executes an unpredictable combination of applications, such as a handheld portable device. The objective in this case is to adapt to the user starting and terminating various applications and to maintain proper function, at the same time using as little of the device's battery as possible. Lastly, a system consisting of many processors, running an application that does not always make use of all of them, wasting computational power. The objective in this last case is to redistribute the various parts of the application on all available processors, allowing them to share the workload in a more balanced manner so that the application runs faster.

The above runtime managers have been evaluated with simulations as well as experiments on real systems, conforming to the described specifications. They have been demonstrated to improve system efficiency compared to prior work.

Embedded Multi-Core Systems for Mixed Criticality Applications in Dynamic and Changeable Real-Time Environments (EMC2)

European Commission (EC) (EC/FP7/621429), 2014-04-01 -- 2017-03-31.

VINNOVA (2014-00607), 2014-04-01 -- 2017-03-31.

Show Project

Energy-efficient Heterogeneous COmputing at exaSCALE (ECOSCALE)

European Commission (EC) (EC/H2020/671632), 2015-10-01 -- 2018-12-31.

Show Project

Meeting Challenges in Computer Architecture (MECCA)

European Commission (EC) (EC/FP7/340328), 2014-02-01 -- 2019-01-31.

Show Project

on-Demand System Reliability (DeSyRe)

European Commission (EC) (EC/FP7/287611), 2011-10-01 -- 2015-01-31.

Show Project

Subject Categories (SSIF 2011)

Computer Engineering

Embedded Systems

Computer Systems

Areas of Advance

Information and Communication Technology

Energy

ISBN

978-91-7597-878-9

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 4559

Publisher

Chalmers

Room EA, Rännvägen 4, Chalmers

Opponent: Professor Luigi Carro, Universidade Federal do Rio Grande do Sul, Brazil

More information

Latest update

2/28/2019

Runtime Management of Multiprocessor Systems for Fault Tolerance, Energy Efficiency and Load Balancing Doctoral thesis, 2019

Author

Stavros Tzilis

A runtime manager for gracefully degrading SoCs

A Probabilistic Analysis of Resilient Reconfigurable Designs

A dependable coarse-grain reconfigurable multicore array

The DeSyRe runtime support for fault-tolerant embedded MPSoCs

Reducing the performance overhead of resilient CMPs with substitutable resources

Resilient chip multiprocessors with mixed-grained reconfigurability

Runtime Management of Adaptive MPSoCs for Graceful Degradation

Energy-efficient Runtime Management of Heterogeneous Multicores using Online Projection

SWAS: Stealing Work Using Approximate System-Load Information

Embedded Multi-Core Systems for Mixed Criticality Applications in Dynamic and Changeable Real-Time Environments (EMC2)

Energy-efficient Heterogeneous COmputing at exaSCALE (ECOSCALE)

Meeting Challenges in Computer Architecture (MECCA)

on-Demand System Reliability (DeSyRe)

Subject Categories (SSIF 2011)

Areas of Advance

ISBN

Publisher

More information

Latest update

Runtime Management of Multiprocessor Systems for Fault Tolerance, Energy Efficiency and Load Balancing
Doctoral thesis, 2019