Accelerating CNN inference via co-design of convolutional algorithms and long vector processors
Licentiatavhandling, 2025
Model serving has become crucial for AI applications, with convolutional neural networks (CNNs) driving various applications from object detection to speech recognition. While specialized accelerators and GPUs offer high performance for CNN inference, CPU-based solutions provide better availability and portability for server-side and mobile computing. Vector architectures such as RISC-V Vector extension and ARM Scalable Vector Extension have emerged as a promising solution, offering GPU-like parallel processing capabilities with low latency, high availability, and lower energy consumption. This thesis investigates co-design opportunities in vector architectures for CNN inference, focusing on the interplay between convolutional algorithmic optimizations and hardware design choices. First, it conducts a co-design study that explores both convolutional algorithm optimizations and hardware parameter tuning such as vector lengths, cache sizes, and vector lanes for CNN inference on ARM-SVE and RISC-VV architectures. Second, it explores the co-design of CNN layers by studying three distinct algorithmic implementations: Direct, im2col+GEMM, and Winograd, in conjunction with hardware parameters for RISC-VV. While optimizing the im2col+GEMM algorithm, various optimizations have been applied to the GEMM kernel; however, our study shows that not all optimizations benefit different vector architectures equally. Our co-design study using the gem5 simulator demonstrates an ∼5× performance improvement with 16384-bit vector lengths and 256MB of L2 cache, compared to 512-bit vectors and 1MB of L2 cache. Since larger tile sizes cannot be used for the Winograd algorithm due to numerical inaccuracies, this thesis proposes intertile parallelism across the input/output channels using 8×8 tiles per channel to utilize longer vector lengths. This approach improves data reuse and achieves an additional performance improvement of 2.4× (compared to im2col+GEMM) on the A64FX processor. Our co-design study also shows that the Winograd algorithm has lower cache size requirements compared to im2col+GEMM. The performance of convolutional algorithms depends on layer dimensions (input/output/kernel dimensions, stride, and input/output channels), while computational demands influence SIMD requirements, and cache sharing impacts runtime algorithm selection in model serving. Our study shows that Winograd performs better with smaller vector lengths, whereas the Direct algorithm excels with longer vectors. While im2col+GEMM benefits from larger caches, Direct and Winograd exhibit varying cache sensitivity across VGG16 layers. In contrast, all YOLOv3 layers benefit from the largest simulated L2 cache across all algorithms. To address these complexities, this thesis proposes a random forest classifier that selects the optimal algorithm per layer with 92.8% accuracy and per-layer algorithm selection improves performance by ∼2× compared to using a single algorithm. Finally, our Pareto analysis of area-performance trade-offs for a 7nm RISC-V multicore model shows that algorithm selection leads to increased throughput per area, highlighting the need for co-design in the context of model serving.
vector length agnostic ISAs
Direct
model serving
CNNs
long vector architectures
Winograd
GEMM
optimizations
co-design