Accordion: A malleable pipeline scheduling approach for adaptive SLO-aware inference serving
Paper in proceeding, 2025
To address these, we propose an adaptive solution leveraging SLO-aware scheduling techniques to optimize resource allocation. Our approach aims to minimize the need for additional resources per inference service. By introducing malleable inference pipelines, we enhance flexibility in resource allocation during peak loads by readjusting the resource assignment to processing pipelines to accommodate maximum possible queries dynamically.
Our findings indicate that the proposed scheduler effectively utilizes system resources throughout execution while meeting most SLOs (4.2× fewer SLO violations). We observe an average reduction of 1.6× in the end-to-end latency of query processing, compared to baseline methods. We also demonstrate the impact of dynamically reducing the resources per inference query to accommodate more inference queries in the system. Our solution accommodates 1.4× more queries on average compared to the baselines and achieves 1.6× higher system throughput in terms of queries per second on average.
Malleable Task Sched- uling
SLO-Aware scheduling techniques
Parallel Pipelines
Inference Serving System
Author
Pirah Noor Soomro
Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)
Nikela Papadopoulou
Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)
Miquel Pericas
Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)
Proceedings of the 22nd ACM International Conference on Computing Frontiers
2687‑9247 (ISSN)
Vol. 1 159-167979-8-4007-1528-0 (ISBN)
, Italy,
Subject Categories (SSIF 2025)
Computer Sciences
Computer Systems
DOI
10.1145/3719276.3725190
ISBN
9798400715280