Automated CNN pipeline generation for heterogeneous architectures
Licentiate thesis, 2022

Heterogeneity is a vital feature in emerging processor chip designing.
Asymmetric multicore-clusters such as high-performance cluster and power efficient cluster
are common in modern edge devices. One example is Intel's Alder Lake featuring Golden Cove high-performance cores and Gracemont power-efficient cores. Chiplet-based technology allows organization of multi cores in form of multi-chip-modules, thus housing large number of cores in a processor.
Interposer based packaging has enabled embedding High Bandwidth Memory (HBM) on chip and reduced transmission latency and energy consumption of chiplet-chiplet interconnect. For Instance Intel's XeHPC Ponte Vecchio package integrates multi-chip GPU organization along with HBM modules.
Since new devices feature heterogeneity at the level of cores, memory and on-chip interconnect, it has become important to steer optimization at application level in order to leverage the new heterogeneous, high-performing and power-efficient features of underlying computing platforms.
An important high-performance application paradigm is Convolution Neural Networks (CNN). CNNs are widely used in many practical applications. The pipelined parallel implementation of CNN is favored for inference on edge devices.
In this Licentiate thesis we present a novel scheme for automatic scheduling of CNN pipelines on heterogeneous devices. A pipeline schedule is a configuration that provides information on depth of pipeline, grouping of CNN layers into pipeline stages and mapping of pipeline stages onto computing units. We utilize simple compile-time hints which consists of workload information of individual CNN layers and performance hints of computing units.
The proposed approach provides near optimal solution for a throughput maximizing pipeline.
We model the problem as a design space exploration technique. We developed a time-efficient design space navigation through heuristics extracted from the knowledge of CNN structure and underlying computing platform. The proposed search scheme converges faster and utilizes real-time performance measurements as fitness values.
The results demonstrate that the proposed scheme converges faster and can scale when used with larger networks and computing platforms.
Since the scheme utilizes online performance measurements, one of the challenges is to avoid expensive configurations during online tuning. The results demonstrate that on average, ~80% of the tested configurations are sub-optimal solutions.
Another challenge is to reduce convergence time. The experiments show that proposed approach is 35x faster than stochastic optimization algorithms. Since the design space is large and complex, We show that the proposed scheme explores only ~0.1% of the total design space in case of large CNNs (having 50+ layers) and results in near-optimal solution.

Online tuning

CNN parallel pipelines

Heterogeneous computing units

Processing on chiplets

Design space exploration

Room 8103, EDIT Rännvägen 6B, Chalmers University of Technology, Campus Johanneberg
Opponent: Eduardo Quiñones, Department of Computer Science at the Barcelona Supercomputing Center (BSC), Spain

Author

Pirah Noor Soomro

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

An online guided tuning approach to run CNN pipelines on edge devices

Proceedings of the 18th ACM International Conference on Computing Frontiers 2021, CF 2021,; (2021)p. 45-53

Paper in proceeding

Shisha: Online Scheduling of CNN Pipelines on Heterogeneous Architectures

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),; Vol. 13826 LNCS(2023)p. 249-262

Paper in proceeding

Scheduling Task-parallel Applications in Dynamically Asymmetric Environments

ACM International Conference Proceeding Series,; (2020)

Paper in proceeding

Low-energy toolset for heterogeneous computing (LEGaTO)

European Commission (EC) (EC/H2020/780681), 2018-02-01 -- 2021-01-31.

Principer för beräknande minnesenheter (PRIDE)

Swedish Foundation for Strategic Research (SSF) (DnrCHI19-0048), 2021-01-01 -- 2025-12-31.

Subject Categories

Computer Engineering

Computer Science

Computer Systems

Infrastructure

C3SE (Chalmers Centre for Computational Science and Engineering)

Publisher

Chalmers

Room 8103, EDIT Rännvägen 6B, Chalmers University of Technology, Campus Johanneberg

Online

Opponent: Eduardo Quiñones, Department of Computer Science at the Barcelona Supercomputing Center (BSC), Spain

More information

Latest update

8/14/2023