Analysis and Optimization of Communication Overheads in Multi-core Architectures
Licentiate thesis, 2013

The transition to multi-core architectures can be attributed mainly to fundamental limitations in clock frequency scaling coupled with a slow growth in uniprocessor performance effected by the challenges in exploiting instruction-level parallelism. Consequently, programmers can no longer realize significant performance gains without investing effort into parallelizing applications. The shared memory paradigm offers a natural advantage in this context because it obviates the need for explicitly managing communication in applications and instead lets programmer’s focus on identifying and expressing parallelism. Communication, established by performing loads and stores to shared memory locations, is an inherent aspect of the shared memory model. With increasing core counts and growing dominance of wire delay the impact of establishing communication is bound to increase. The goals of this thesis are twofold: (1) to analyze the impact of communication overheads on scalability of applications and the implications of such overheads on CMP design choices and (2) to devise new approaches to reduce communication overheads and its impact on scalability in the light of modern task-based runtime systems. The first study analyzes the impact of merging phases on scalability of data-mining applications. The merging phase assembles partial results from multiple threads and has an inherently serial component, that grows with the number of cores. The results establish that scalability of such applications is much lower than what is predicted using Amdahl’s law. It also shows that such applications favor designs with fewer large cores over designs with several small cores. The second study proposes architectural support for data forwarding to mitigate communication overheads associated with producer-consumer sharing. Existing forwarding approaches, that proactively forward data from the producer to consumer, are shown to have limited applicability for task-based applications. An alternate technique is proposed in which producers track the identity of the updated blocks and initiates forwarding after receiving an initial request from the consumer. The technique leverages the observation about spatial locality in producerconsumer data to simplify hardware changes needed for tracking updates. The proposed forwarding scheme is shown to mitigate communication overheads due to producer-consumer sharing. Finally, the third study investigates the potential of using inter-task dependency and mapping information available to the runtime system, to facilitate coherence optimizations. This study shows that by conveying information to the underlying cache coherence substrate, coherence optimizations can be triggered which in-turn can help significantly reduce communication overheads associated with prominent sharing patterns.

task parallelism


cache coherence

runtime systems

sharing patterns

Amdahl’s Law

Room EB, EDIT Building, Rännvägen 6
Opponent: Dr. Pedro Trancoso, University of Cyprus


Madhavan Manivannan

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

Implications of Merging Phases on Scalability of Multi-core Architectures

Proceedings of the International Conference on Parallel Processing. 40th International Conference on Parallel Processing, ICPP 2011, Taipei City, 13-16 September 2011,; (2011)p. 622-631

Paper in proceeding

Subject Categories

Computer Engineering

Areas of Advance

Information and Communication Technology

Room EB, EDIT Building, Rännvägen 6

Opponent: Dr. Pedro Trancoso, University of Cyprus

More information