A Cache-centric Execution Model and Runtime for Deep Parallel Multicore Topologies
Paper in proceeding, 2016

Computational task DAGs are executed on parallel computers by a task scheduling algorithm. Intelligent scheduling is critical for achieving high parallelism, low overheads and reduced communication. A key technique for load balancing task DAGs is work stealing (WS), which Blumofe et al. popularized for fork-join computations [2]. In scenarios of high parallel slackness, WS's distributed nature allows it to scale to a large number of cores with low overhead [4]. However, the space of a WS computation grows proportionally to the number of cores. Targeting a lower bound, Blelloch et al. proposed the parallel-depth-first (PDF) scheduler [1]. PDF schedules tasks by following the depth-first (serial) order of computation and has space requirements closer to the serial execution. PDF has been shown to provide constructive cache sharing in modern multicore architectures [3]. However, implementing PDF requires a centralized scheduler which limits scalability. Targeting NUMA architectures, Olivier et al. proposed to load balance multiple PDF schedulers via WS [8]. While enabling scalability to larger systems, such approach still suffers from centralized scheduling of fine-grained parallelism [9]. Furthermore, for applications in which the amount of parallelism varies greatly, a fixed hierarchy of PDF queues is not enough.

task scheduling

constructive cache sharing

resource management

multicores

Author

Miquel Pericas

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

1089795X (ISSN)

Vol. 2016 429-431

25th International Conference on Parallel Architectures and Compilation Techniques, PACT 2016
Haifa, Israel,

Subject Categories (SSIF 2011)

Control Engineering

DOI

10.1145/2967938.2974052

More information

Latest update

2/17/2021