BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators
Paper in proceeding, 2026

Multi-DNN accelerators enable the simultaneous execution of multiple DNN workloads which improves performance by overlapping computations and memory accesses of multiple DNN workloads. However, on-chip memory must accommodate the footprint of all workloads. Batching allows DNN inferences using the same model to share weights which improves weight reuse and reducing off-chip access costs over a batch. Batching determines the batch size statically, leading to stalls when there is not enough on-chip memory available at runtime. This paper introduces BATCH-DNN, a dynamic method for adapting batch size on a layer-by-layer basis to available on-chip memory. It employs two techniques: adaptive cascaded sub-batching and adaptive sub-batch merging. Offline profiling establishes the footprint, while run-time adjustment establishes the maximum batch size on a layer-by-layer basis based on available on-chip memory. BATCH-DNN can improve the utilization of accelerator compute fabrics by 60%, which increases throughput by up to 27% and by 6%, on average, for batched multi-DNN workloads.

On-chip memory management

Multi-DNN accelerator

Batching

Author

Piyumal Ranawaka

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

University of Gothenburg

Per Stenström

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

University of Gothenburg

Lecture Notes in Computer Science

0302-9743 (ISSN) 1611-3349 (eISSN)

Vol. 15901 LNCS 103-117
9783031998560 (ISBN)

31st International Conference on Parallel and Distributed Computing, Euro-Par 2025
Dresden, Germany,

Subject Categories (SSIF 2025)

Computer Sciences

Computer Systems

DOI

10.1007/978-3-031-99857-7_8

More information

Latest update

9/5/2025 1