Analysis and Optimization of Communication Overheads in Multi-core Architectures
Licentiate thesis, 2013
The transition to multi-core architectures can be attributed mainly to fundamental limitations in clock
frequency scaling coupled with a slow growth in uniprocessor performance effected by the challenges
in exploiting instruction-level parallelism. Consequently, programmers can no longer realize significant performance gains without investing effort into parallelizing applications. The shared memory paradigm offers a natural advantage in this context because it obviates the need for explicitly managing communication in applications and instead lets programmer’s focus on identifying and expressing parallelism. Communication, established by performing loads and stores to shared memory locations, is an inherent aspect of the shared memory model. With increasing core counts and growing dominance of wire delay
the impact of establishing communication is bound to increase. The goals of this thesis are twofold: (1)
to analyze the impact of communication overheads on scalability of applications and the implications
of such overheads on CMP design choices and (2) to devise new approaches to reduce communication overheads and its impact on scalability in the light of modern task-based runtime systems.
The first study analyzes the impact of merging phases on scalability of data-mining applications. The merging phase assembles partial results from multiple threads and has an inherently serial component, that grows with the number of cores. The results establish that scalability of such applications is much lower than what is predicted using Amdahl’s law. It also shows that such applications favor designs with fewer large cores over designs with several small cores. The second study proposes architectural
support for data forwarding to mitigate communication overheads associated with producer-consumer sharing. Existing forwarding approaches, that proactively forward data from the producer to consumer, are shown to have limited applicability for task-based applications. An alternate technique is proposed in which producers track the identity of the updated blocks and initiates forwarding after receiving an initial request from the consumer. The technique leverages the observation about spatial locality in producerconsumer data to simplify hardware changes needed for tracking updates. The proposed forwarding
scheme is shown to mitigate communication overheads due to producer-consumer sharing. Finally, the third study investigates the potential of using inter-task dependency and mapping information available to the runtime system, to facilitate coherence optimizations. This study shows that by conveying information to the underlying cache coherence substrate, coherence optimizations can be triggered which in-turn can help significantly reduce communication overheads associated with prominent sharing patterns.
task parallelism
multi-core
cache coherence
runtime systems
sharing patterns
Amdahl’s Law