Data Sketches and Parallelism for Efficient Data Pipelines
Licentiate thesis, 2026
Parallelising the construction of sketches becomes essential with high data rates and volumes. Simultaneously, long-running analytics processes in continuous, high-rate pipelines require concurrent, low-latency querying of sketches, concurrently with updates. Further, the ability to estimate several query types from a single sketch avoids construction of multiple different sketches, for space, timeliness, and consistency reasons. Integrating all these dimensions for the first time, this thesis proposes LMQ-Sketch and explores the challenging trade-offs in designing parallel data sketches for the classical problem of frequency-moment and item frequency estimation. The work sheds light to necessary synchronisation among concurrent operations, carefully balancing consistency, freshness, and accuracy with a low memory footprint and high throughput.
Data-intensive systems that distribute work across threads or nodes by locally building sketches — subsequently shared and merged — but may face workload and resource fluctuations. To this end, the thesis introduces ReSketch as a resizable, partitionable sketch with enhanced mergeability, permitting merging sketches of different sizes to obtain a good balance of memory footprint and accuracy, while enabling novel elastic capabilities.
Beyond studying data summaries as pipeline components, the thesis presents FORTE, a framework for lossless data transfer pipelines where compression, necessary for timely, reliable, and efficient transmission, must be balanced against resource constraints, while ensuring data integrity, confidentiality, and governance. Latency, throughput, and sustainable rates are measured in a real-world industrial use-case managing TBs of data per day, demonstrating benefits of effective pipelining, scheduling, and efficient use of resources. Altogether, this thesis examines how data reduction and pipeline optimisation can enable parallel, efficient, and performant data pipelines, opening new avenues for bridging the gap between algorithms and practical deployment in adaptable, scalable, resource-aware data processing systems.
Data Sketches
Scheduling
Approximate Query Processing
Compression
Data Summarisation
Data Pipelines
Parallelism and Concurrency
Hashing
Author
Martin Hilgendorf
Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems
LMQ-Sketch: Lagom Multi-Query Sketch for High-Rate Online Analytics
Leibniz International Proceedings in Informatics, LIPIcs,;Vol. 356(2025)p. 36:1-36:24
Paper in proceeding
ReSketch: A Mergeable, Partitionable, and Resizable Sketch
FORTE: an extensible framework for robustness and efficiency in data transfer pipelines
DEBS 2023 - Proceedings of the 17th ACM International Conference on Distributed and Event-based Systems,;(2023)p. 139-150
Paper in proceeding
VR EPITOME - Summarization and structuring of continuous data in concurrent processing pipelines
Swedish Research Council (VR) (2021-05424), 2022-01-01 -- 2025-12-31.
Subject Categories (SSIF 2025)
Computer Sciences
Networked, Parallel and Distributed Computing
Publisher
Chalmers
EB, EDIT Building, Hörsalsvägen 11
Opponent: Paris Carbone, KTH Royal Institute of Technology