Automated CNN pipeline generation for heterogeneous architectures
Licentiatavhandling, 2022
Asymmetric multicore-clusters such as high-performance cluster and power efficient cluster
are common in modern edge devices. One example is Intel's Alder Lake featuring Golden Cove high-performance cores and Gracemont power-efficient cores. Chiplet-based technology allows organization of multi cores in form of multi-chip-modules, thus housing large number of cores in a processor.
Interposer based packaging has enabled embedding High Bandwidth Memory (HBM) on chip and reduced transmission latency and energy consumption of chiplet-chiplet interconnect. For Instance Intel's XeHPC Ponte Vecchio package integrates multi-chip GPU organization along with HBM modules.
Since new devices feature heterogeneity at the level of cores, memory and on-chip interconnect, it has become important to steer optimization at application level in order to leverage the new heterogeneous, high-performing and power-efficient features of underlying computing platforms.
An important high-performance application paradigm is Convolution Neural Networks (CNN). CNNs are widely used in many practical applications. The pipelined parallel implementation of CNN is favored for inference on edge devices.
In this Licentiate thesis we present a novel scheme for automatic scheduling of CNN pipelines on heterogeneous devices. A pipeline schedule is a configuration that provides information on depth of pipeline, grouping of CNN layers into pipeline stages and mapping of pipeline stages onto computing units. We utilize simple compile-time hints which consists of workload information of individual CNN layers and performance hints of computing units.
The proposed approach provides near optimal solution for a throughput maximizing pipeline.
We model the problem as a design space exploration technique. We developed a time-efficient design space navigation through heuristics extracted from the knowledge of CNN structure and underlying computing platform. The proposed search scheme converges faster and utilizes real-time performance measurements as fitness values.
The results demonstrate that the proposed scheme converges faster and can scale when used with larger networks and computing platforms.
Since the scheme utilizes online performance measurements, one of the challenges is to avoid expensive configurations during online tuning. The results demonstrate that on average, ~80% of the tested configurations are sub-optimal solutions.
Another challenge is to reduce convergence time. The experiments show that proposed approach is 35x faster than stochastic optimization algorithms. Since the design space is large and complex, We show that the proposed scheme explores only ~0.1% of the total design space in case of large CNNs (having 50+ layers) and results in near-optimal solution.
Online tuning
CNN parallel pipelines
Heterogeneous computing units
Processing on chiplets
Design space exploration
Författare
Pirah Noor Soomro
Chalmers, Data- och informationsteknik, Datorteknik
An online guided tuning approach to run CNN pipelines on edge devices
Proceedings of the 18th ACM International Conference on Computing Frontiers 2021, CF 2021,; (2021)p. 45-53
Paper i proceeding
Shisha: Online Scheduling of CNN Pipelines on Heterogeneous Architectures
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),; Vol. 13826 LNCS(2023)p. 249-262
Paper i proceeding
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
ACM International Conference Proceeding Series,; (2020)
Paper i proceeding
Low-energy toolset for heterogeneous computing (LEGaTO)
Europeiska kommissionen (EU) (EC/H2020/780681), 2018-02-01 -- 2021-01-31.
Principer för beräknande minnesenheter (PRIDE)
Stiftelsen för Strategisk forskning (SSF) (DnrCHI19-0048), 2021-01-01 -- 2025-12-31.
Ämneskategorier
Datorteknik
Datavetenskap (datalogi)
Datorsystem
Infrastruktur
C3SE (Chalmers Centre for Computational Science and Engineering)
Utgivare
Chalmers
Room 8103, EDIT Rännvägen 6B, Chalmers University of Technology, Campus Johanneberg
Opponent: Eduardo Quiñones, Department of Computer Science at the Barcelona Supercomputing Center (BSC), Spain