Accordion: A malleable pipeline scheduling approach for adaptive SLO-aware inference serving
Paper i proceeding, 2025
To address these, we propose an adaptive solution leveraging SLO-aware scheduling techniques to optimize resource allocation. Our approach aims to minimize the need for additional resources per inference service. By introducing malleable inference pipelines, we enhance flexibility in resource allocation during peak loads by readjusting the resource assignment to processing pipelines to accommodate maximum possible queries dynamically.
Our findings indicate that the proposed scheduler effectively utilizes system resources throughout execution while meeting most SLOs (4.2× fewer SLO violations). We observe an average reduction of 1.6× in the end-to-end latency of query processing, compared to baseline methods. We also demonstrate the impact of dynamically reducing the resources per inference query to accommodate more inference queries in the system. Our solution accommodates 1.4× more queries on average compared to the baselines and achieves 1.6× higher system throughput in terms of queries per second on average.
Malleable Task Sched- uling
SLO-Aware scheduling techniques
Parallel Pipelines
Inference Serving System
Författare
Pirah Noor Soomro
Chalmers, Data- och informationsteknik, Datorteknik
Nikela Papadopoulou
Chalmers, Data- och informationsteknik, Datorteknik
Miquel Pericas
Chalmers, Data- och informationsteknik, Datorteknik
Proceedings of the 22nd ACM International Conference on Computing Frontiers
2687‑9247 (ISSN)
Vol. 1 159-167979-8-4007-1528-0 (ISBN)
, Italy,
Ämneskategorier (SSIF 2025)
Datavetenskap (datalogi)
Datorsystem
DOI
10.1145/3719276.3725190
ISBN
9798400715280