Accelerating CNN inference on long vector architectures via co-design

Sonia Rani Gupta; Nikela Papadopoulou; Miquel Pericas

doi:10.1109/IPDPS54959.2023.00024

Accelerating CNN inference on long vector architectures via co-design
Paper in proceeding, 2023

CPU-based inference can be deployed as an alternative to off-chip accelerators. In this context, emerging vector architectures are a promising option, owing to their high efficiency. Yet the large design space of convolutional algorithms and hardware implementations makes the selection of design options challenging. In this paper, we present our ongoing research into co-designing future vector architectures for CPU-based Convolutional Neural Networks (CNN) inference focusing on the im2col+GEMM and Winograd kernels. Using the Gem5 simulator we explore the impact of several hardware microarchitectural features including (i) vector lanes, (ii) vector lengths, (iii) cache sizes, and (iv) options for integrating the vector unit into the CPU pipeline. In the context of im2col+GEMM, we study the impact of several BLIS-like algorithmic optimizations such as (1) utilization of vector registers, (2) loop unrolling, (3) loop reorder, (4) manual vectorization, (5) prefetching, and (6) packing of matrices, on the RISC-V Vector Extension and ARM-SVE ISAs. We use the YOLOv3 and VGG16 network models for our evaluation. Our co-design study shows that BLIS-like optimizations are not beneficial to all types of vector microarchitectures. We additionally demonstrate that longer vector lengths (of at least 8192 bits) and larger caches (of 256MB) can boost performance by 5×, with our optimized CNN kernels, compared to a vector length of 512-bit and 1MB of L2 cache. In the context of Winograd, we present our novel approach of inter-tile parallelization across the input/output channels by using 8×8 tiles per channel to vectorize the algorithm on vector length agnostic (VLA) architectures. Our method exploits longer vector lengths and offers high memory reuse, resulting in performance improvement of up to 2.4× for non-strided convolutional layers with 3×3 kernel size, compared to our optimized im2col+GEMM approach on the Fujitsu A64FX processor. Our co-design study furthermore reveals that Winograd requires smaller cache sizes (up to 64MB) compared to im2col+GEMM.

vector-length agnostic ISAs

long vector architectures

Winograd

GEMM

CNNs

co-design

optimizations

Author

Sonia Rani Gupta

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

Other publications Research

Nikela Papadopoulou

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

Other publications Research

Miquel Pericas

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

Other publications Research

Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023

145-155
9798350337662 (ISBN)

37th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023
St. Petersburg, USA,

P4PIM: Principles of power-constrained HPC programming for PIM networks

Swedish Research Council (VR) (2020-04892), 2021-01-01 -- 2024-12-31.

Show Project

Subject Categories (SSIF 2011)

Embedded Systems

Computer Science

Computer Systems

DOI

10.1109/IPDPS54959.2023.00024

Publication data connected to DOI

More information

Latest update

8/17/2023

Accelerating CNN inference on long vector architectures via co-design Paper in proceeding, 2023

Author

Sonia Rani Gupta

Nikela Papadopoulou

Miquel Pericas

Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023

P4PIM: Principles of power-constrained HPC programming for PIM networks

Subject Categories (SSIF 2011)

DOI

More information

Latest update

Accelerating CNN inference on long vector architectures via co-design
Paper in proceeding, 2023