A Cache-centric Execution Model and Runtime for Deep Parallel Multicore Topologies
Paper in proceeding, 2016
Computational task DAGs are executed on parallel computers by a task scheduling algorithm. Intelligent scheduling is critical for achieving high parallelism, low overheads and reduced communication. A key technique for load balancing task DAGs is work stealing (WS), which Blumofe et al. popularized for fork-join computations . In scenarios of high parallel slackness, WS's distributed nature allows it to scale to a large number of cores with low overhead . However, the space of a WS computation grows proportionally to the number of cores. Targeting a lower bound, Blelloch et al. proposed the parallel-depth-first (PDF) scheduler . PDF schedules tasks by following the depth-first (serial) order of computation and has space requirements closer to the serial execution. PDF has been shown to provide constructive cache sharing in modern multicore architectures . However, implementing PDF requires a centralized scheduler which limits scalability. Targeting NUMA architectures, Olivier et al. proposed to load balance multiple PDF schedulers via WS . While enabling scalability to larger systems, such approach still suffers from centralized scheduling of fine-grained parallelism . Furthermore, for applications in which the amount of parallelism varies greatly, a fixed hierarchy of PDF queues is not enough.
constructive cache sharing