Automated CNN pipeline generation for heterogeneous architectures
Licentiatavhandling, 2022

Heterogeneity is a vital feature in emerging processor chip designing.
Asymmetric multicore-clusters such as high-performance cluster and power efficient cluster
are common in modern edge devices. One example is Intel's Alder Lake featuring Golden Cove high-performance cores and Gracemont power-efficient cores. Chiplet-based technology allows organization of multi cores in form of multi-chip-modules, thus housing large number of cores in a processor.
Interposer based packaging has enabled embedding High Bandwidth Memory (HBM) on chip and reduced transmission latency and energy consumption of chiplet-chiplet interconnect. For Instance Intel's XeHPC Ponte Vecchio package integrates multi-chip GPU organization along with HBM modules.
Since new devices feature heterogeneity at the level of cores, memory and on-chip interconnect, it has become important to steer optimization at application level in order to leverage the new heterogeneous, high-performing and power-efficient features of underlying computing platforms.
An important high-performance application paradigm is Convolution Neural Networks (CNN). CNNs are widely used in many practical applications. The pipelined parallel implementation of CNN is favored for inference on edge devices.
In this Licentiate thesis we present a novel scheme for automatic scheduling of CNN pipelines on heterogeneous devices. A pipeline schedule is a configuration that provides information on depth of pipeline, grouping of CNN layers into pipeline stages and mapping of pipeline stages onto computing units. We utilize simple compile-time hints which consists of workload information of individual CNN layers and performance hints of computing units.
The proposed approach provides near optimal solution for a throughput maximizing pipeline.
We model the problem as a design space exploration technique. We developed a time-efficient design space navigation through heuristics extracted from the knowledge of CNN structure and underlying computing platform. The proposed search scheme converges faster and utilizes real-time performance measurements as fitness values.
The results demonstrate that the proposed scheme converges faster and can scale when used with larger networks and computing platforms.
Since the scheme utilizes online performance measurements, one of the challenges is to avoid expensive configurations during online tuning. The results demonstrate that on average, ~80\% of the tested configurations are sub-optimal solutions.
Another challenge is to reduce convergence time. The experiments show that proposed approach is 35x faster than stochastic optimization algorithms. Since the design space is large and complex, We show that the proposed scheme explores only ~0.1% of the total design space in case of large CNNs (having 50+ layers) and results in near-optimal solution.

Heterogeneous computing units

Processing on chiplets

Online tuning

Design space exploration

CNN parallel pipelines

Room 8103, EDIT Rännvägen 6B, Chalmers University of Technology, Campus Johanneberg
Opponent: Eduardo Quiñones, Department of Computer Science at the Barcelona Supercomputing Center (BSC), Spain


Pirah Noor Soomro

Chalmers, Data- och informationsteknik, Datorteknik, Computer Systems

An online guided tuning approach to run CNN pipelines on edge devices

Proceedings of the 18th ACM International Conference on Computing Frontiers 2021, CF 2021,; (2021)p. 45-53

Paper i proceeding

Shisha: Online scheduling of CNN pipelines on heterogeneous architectures'' . Pirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon, and Miquel Pericàs. In submission to Euro-Par 2022: Parallel Processing: 28th International Conference on Parallel and Distributed Computing.

Scheduling Task-parallel Applications in Dynamically Asymmetric Environments

ACM International Conference Proceeding Series,; (2020)

Paper i proceeding

Low-energy toolset for heterogeneous computing (LEGaTO)

Europeiska kommissionen (EU) (EC/H2020/780681), 2018-02-01 -- 2021-01-31.

Principer för beräknande minnesenheter (PRIDE)

Stiftelsen för Strategisk forskning (SSF) (DnrCHI19-0048), 2021-01-01 -- 2025-12-31.



Datavetenskap (datalogi)



C3SE (Chalmers Centre for Computational Science and Engineering)



Room 8103, EDIT Rännvägen 6B, Chalmers University of Technology, Campus Johanneberg


Opponent: Eduardo Quiñones, Department of Computer Science at the Barcelona Supercomputing Center (BSC), Spain

Mer information