Techniques to Improve Management of On-Chip Memories of Deep Learning Accelerators

Piyumal Ranawaka

Techniques to Improve Management of On-Chip Memories of Deep Learning Accelerators
Licentiate thesis, 2025

This thesis focuses on two open problems in the literature related to improving the management of on-chip memories of deep learning accelerators for improved resource utilization.

The first part of this thesis focuses on deep learning accelerators that suffer from poor on-chip memory utilization, affecting performance and energy efficiency. Techniques such as loop reordering and blocking are used to improve this, but existing frameworks can be inefficient due to either high algorithmic computational complexity for searching or, due to suboptimal choices due to compromised search space. This paper presents DNNOPT, a hardware/software framework that optimally selects loop orders and blocking factors through two proposed stratergies: Early Exit and Strided Search, to prune the search space and simple analytical models for data reuse. DNNOPT reduces the search space by over two orders of magnitude and enhances performance, energy efficiency, and time to solution by an average of 1.8×, 50%, and 226×, respectively, for CNN and Transformer workloads compared to current frameworks.

The second part of this thesis focuses on accelerators where multiple neural network inferences can run simultaneously that allow the simultaneous execution of multiple Deep Neural Network (DNN) workloads, improving performance by overlapping computations and memory access. For effective operation, sufficient on-chip memory is necessary to accommodate the total memory footprint of all workloads. Batching enhances weight reuse and lowers off-chip access costs by enabling DNN inferences of the same model to share weights. However, traditional batching, which sets the batch size statically across all layers, can cause stalls when on-chip memory is insufficient. This paper introduces BATCH-DNN, a dynamic method that adjusts the batch size of each layer based on available on-chip memory. It employs techniques such as adaptive cascaded sub-batching and adaptive sub-batch merging. Offline profiling determines the memory footprint, while runtime adjustments set the maximum batch size per layer. BATCH-DNN can boost accelerator compute fabric utilization by 40%, leading to throughput improvements of up to 27%, with an average enhancement of 6% for batched multi-DNN workloads.

Loop Re- ordering

Deep Learning Accelerator On-Chip Memory

Loop Blocking

Loop Optimizations

Dynamic and Adaptive Batching

Chalmers EDIT Analysen

Opponent: Prof. Magnus Jahre, NTNU, Norway

Author

Piyumal Ranawaka

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

Other publications Research

P. Ranawaka, P. Stenstrom, BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators

DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN Accelerators

Proceedings of the 21st ACM International Conference on Computing Frontiers, CF 2024,;(2024)p. 126-137

Paper in proceeding

Areas of Advance

Information and Communication Technology

Subject Categories (SSIF 2025)

Computer Sciences

Computer Systems

Publisher

Chalmers