Co-Design of Convolutional Algorithms and Long Vector RISC-V Processors for Efficient CNN Model Serving
Paper i proceeding, 2024
The performance of convolutional algorithm depends on the size, stride, and input/output channels of the convolutional kernel. Moreover, the varying computational demands of convolutional layers influence the requirement for SIMD support on multicore processors. Finally, sharing cache resources in scenarios such as inference serving also impacts the runtime choice of the best algorithm. To identify the best settings, we perform a co-design exploration, focusing on the software parameters of the convolutional layers of convolutional neural networks (CNNs), and three distinct algorithmic implementations: Direct, im2col+GEMM, and Winograd, jointly with hardware parameters for vector architectures. Our simulation-based study identifies that Winograd is suitable for convolutional layers with a 3 × 3 kernel size and stride 1, specifically for shorter vector lengths and L2 cache sizes. For layers with more input/output channels, im2col+GEMM performs better. Looking at VGG-16, our study shows that not all the layers benefit from our biggest simulated cache memory when using the Direct and Winograd implementations, while the im2col+GEMM implementation scales to an L2 cache memory of 64MB with all layers. In contrast, all the simulated layers of YOLOv3 benefit from an L2 cache memory of 64MB, for all convolutional algorithms. To select the best implementation at runtime, we develop a random forest predictor that selects the best algorithm in over 90% of the cases, with limited degradation when a sub-optimal configuration is selected. We conclude with a Pareto analysis of the area-performance trade-off in an inference serving scenario, on a 7nm RISC-V multicore model with a vector unit supporting vectors of 512 up to 4096 bits.