Adaptive scheduling of inference pipelines on multicore architectures
Doctoral thesis, 2025

In today’s data-driven world, machine learning (ML) algorithms, particularly Convolutional Neural Networks (CNNs), play a pivotal role in powering a myriad of applications across various domains. As the demand for real-time inference continues to escalate, optimizing CNN inference across diverse computational platforms becomes imperative. This thesis addresses this challenge by exploring the complexities posed by heterogeneous edge devices, chiplet-based architectures, and inference-serving systems.

Heterogeneous edge devices present unique challenges due to resource constraints and architectural diversity, while chiplet-based architectures offer potential enhancements in inference performance. Leveraging innovative techniques such as online tuning algorithms, malleable and moldable inference pipelines, and adaptive scheduling strategies, our thesis proposes a comprehensive framework for optimizing DNN inference. This framework aims to advance system performance, reduce latency, and mitigate interference effects, thereby contributing to the development of more efficient and scalable AI systems capable of meeting the evolving demands of real-time inference across diverse computational platforms.

The thesis addresses several key problem statements, including enabling runtime scheduling of inference pipelines on edge devices, fully online scheduling of inference pipelines on heterogeneous platforms, mitigating interference effects on inference pipelines in inference-serving systems, and optimizing resource allocation in inference-serving systems for adaptive SLO-aware inference serving.

The contributions of this thesis are encapsulated in four papers, each focusing on distinct aspects of CNN inference optimization. These contributions include the development of comprehensive frameworks for online scheduling of CNN pipelines, leveraging platform knowledge for expedited seed generation, dynamic scheduling techniques to alleviate interference effects, and SLO-aware scheduling techniques for optimizing resource allocation in inference-serving systems. Through these contributions, this thesis seeks to advance the state-of-the-art in CNN inference optimization and inference-serving systems, paving the way for more efficient and scalable AI systems capable of meeting the demands of real-time inference across diverse computational platforms.

Online tuning

CNN parallel pipelines

Design space exploration

Interference Mitigation

Heterogeneous computing units

Processing on chiplets

Inference Serving Systems

Author

Pirah Noor Soomro

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

Accordion: A malleable pipeline scheduling approach for adaptive SLO-aware inference serving

Proceedings of the 22nd ACM International Conference on Computing Frontiers,;(2025)p. 159-167

Paper in proceeding

ODIN: Overcoming Dynamic Interference in iNference Pipelines

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 14100 LNCS(2023)p. 169-183

Paper in proceeding

Shisha: Online Scheduling of CNN Pipelines on Heterogeneous Architectures

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 13826 LNCS(2023)p. 249-262

Paper in proceeding

An online guided tuning approach to run CNN pipelines on edge devices

Proceedings of the 18th ACM International Conference on Computing Frontiers 2021, CF 2021,;(2021)p. 45-53

Paper in proceeding

Subject Categories (SSIF 2025)

Computer Sciences

Computer Systems

ISBN

978-91-8103-261-1

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5719

Publisher

Chalmers

More information

Latest update

8/8/2025 1