Towards Runtime-Assisted Cache Management for Task-Parallel Programs
Doctoral thesis, 2016
Architects have adopted the shared memory model that implicitly manages cache coherence and cache capacity in hardware, mainly to aid programmability of multi-core architectures. The hardware mechanisms are however prone to inefficiencies because they are not tailored to the behavior of individual parallel applications. Specifically, the manner in which sharing patterns are handled by the coherence protocol (e.g MESI) leads to coherence inefficiencies and the manner in which data access patterns are handled by the replacement policy (e.g LRU) leads to capacity inefficiencies (due to dead blocks). The underlying strategy adopted by hardware-based proposals to address these inefficiencies is simple: glean information about sharing/access patterns by analyzing accesses to cache blocks and enable optimizations if conditions that lead to inefficiencies are detected.
This thesis proposes new approaches that leverage rich information about parallel applications, available to the runtime system in task-based and task data-flow programming models, regarding tasks and its working sets, inter-task dependencies and mapping of tasks to cores in order to effectively address inefficiencies introduced due to hardware management of cache coherence and cache capacity. This thesis also establishes the utility of hardware-based proposals that address the coherence and cache capacity
inefficiencies in the context of such parallel applications.
This thesis makes the following contributions in addressing cache coherence and capacity inefficiencies. The thesis first proposes a forwarding scheme in hardware that tracks updates at the producer with low-overhead and initiates forwarding after consumers issue the first request to access the updated data in order to mitigate coherence overheads in producer-consumer sharing. A hybrid technique is then proposed that detects producer-consumer and migratory sharing patterns in the runtime. This information is then communicated to the cache coherence substrate to trigger appropriate coherence optimization to mitigate coherence overheads due to these sharing patterns. As for cache capacity management this thesis focuses on managing dead blocks which have been shown to lead to inefficient utilization of cache capacity. This thesis proposes a novel technique that exploits information exchange across runtime system and architecture to detect dead blocks in the last level cache more efficiently and signal them for eviction. Finally, this thesis leverages the outlook for future accesses to data provided by the runtime system to identify blocks that are dead in the entire hierarchy and evict them simultaneously. This approach to global management of dead blocks is shown to be beneficial over local approaches adopted by hardware-based proposals that predict dead blocks at each level individually.
task parallelism
cache hierarchy
runtime system
dead blocks
multi-core architecture
sharing patterns
EC, EDIT Building, Chalmers Universoty of Technology
Opponent: Prof. Lawrence Rauchwerger, Department of Computer Science and Engineering , Texas A&M, USA