Techniques to Improve Management of On-Chip Memories of Deep Learning Accelerators
Licentiate thesis, 2025
The first part of this thesis focuses on deep learning accelerators that suffer from poor on-chip memory utilization, affecting performance and energy efficiency. Techniques such as loop reordering and blocking are used to improve this, but existing frameworks can be inefficient due to either high algorithmic computational complexity for searching or, due to suboptimal choices due to compromised search space. This paper presents DNNOPT, a hardware/software framework that optimally selects loop orders and blocking factors through two proposed stratergies: Early Exit and Strided Search, to prune the search space and simple analytical models for data reuse. DNNOPT reduces the search space by over two orders of magnitude and enhances performance, energy efficiency, and time to solution by an average of 1.8×, 50%, and 226×, respectively, for CNN and Transformer workloads compared to current frameworks.
The second part of this thesis focuses on accelerators where multiple neural network inferences can run simultaneously that allow the simultaneous execution of multiple Deep Neural Network (DNN) workloads, improving performance by overlapping computations and memory access. For effective operation, sufficient on-chip memory is necessary to accommodate the total memory footprint of all workloads. Batching enhances weight reuse and lowers off-chip access costs by enabling DNN inferences of the same model to share weights. However, traditional batching, which sets the batch size statically across all layers, can cause stalls when on-chip memory is insufficient. This paper introduces BATCH-DNN, a dynamic method that adjusts the batch size of each layer based on available on-chip memory. It employs techniques such as adaptive cascaded sub-batching and adaptive sub-batch merging. Offline profiling determines the memory footprint, while runtime adjustments set the maximum batch size per layer. BATCH-DNN can boost accelerator compute fabric utilization by 40%, leading to throughput improvements of up to 27%, with an average enhancement of 6% for batched multi-DNN workloads.
Loop Re- ordering
Deep Learning Accelerator On-Chip Memory
Loop Blocking
Loop Optimizations
Dynamic and Adaptive Batching
Author
Piyumal Ranawaka
Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)
P. Ranawaka, P. Stenstrom, BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators
DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN Accelerators
Proceedings of the 21st ACM International Conference on Computing Frontiers, CF 2024,;(2024)p. 126-137
Paper in proceeding
Areas of Advance
Information and Communication Technology
Subject Categories (SSIF 2025)
Computer Sciences
Computer Systems
Publisher
Chalmers
Chalmers EDIT Analysen
Opponent: Prof. Magnus Jahre, NTNU, Norway