Data Sketches and Parallelism for Efficient Data Pipelines

Martin Hilgendorf

Data Sketches and Parallelism for Efficient Data Pipelines
Licentiate thesis, 2026

Data Summarisation transforms massive datasets into convenient and compact summaries, or synopses, to approximate the result of queries. Such summaries are orders of magnitude smaller than the original data, with mathematical guarantees on the approximation accuracy, making them an attractive solution to the challenges of Big Data. In recent years, Data Sketches have seen widespread adoption for efficiently summarising Big Data in a single pass and in small memory and time. This thesis studies several challenges arising when utilising sketches or compression in Big Data processes and pipelines.

Parallelising the construction of sketches becomes essential with high data rates and volumes. Simultaneously, long-running analytics processes in continuous, high-rate pipelines require concurrent, low-latency querying of sketches, concurrently with updates. Further, the ability to estimate several query types from a single sketch avoids construction of multiple different sketches, for space, timeliness, and consistency reasons. Integrating all these dimensions for the first time, this thesis proposes LMQ-Sketch and explores the challenging trade-offs in designing parallel data sketches for the classical problem of frequency-moment and item frequency estimation. The work sheds light to necessary synchronisation among concurrent operations, carefully balancing consistency, freshness, and accuracy with a low memory footprint and high throughput.

Data-intensive systems that distribute work across threads or nodes by locally building sketches — subsequently shared and merged — but may face workload and resource fluctuations. To this end, the thesis introduces ReSketch as a resizable, partitionable sketch with enhanced mergeability, permitting merging sketches of different sizes to obtain a good balance of memory footprint and accuracy, while enabling novel elastic capabilities.

Beyond studying data summaries as pipeline components, the thesis presents FORTE, a framework for lossless data transfer pipelines where compression, necessary for timely, reliable, and efficient transmission, must be balanced against resource constraints, while ensuring data integrity, confidentiality, and governance. Latency, throughput, and sustainable rates are measured in a real-world industrial use-case managing TBs of data per day, demonstrating benefits of effective pipelining, scheduling, and efficient use of resources. Altogether, this thesis examines how data reduction and pipeline optimisation can enable parallel, efficient, and performant data pipelines, opening new avenues for bridging the gap between algorithms and practical deployment in adaptable, scalable, resource-aware data processing systems.

Compression

Approximate Query Processing

Data Sketches

Hashing

Scheduling

Parallelism and Concurrency

Data Pipelines

Data Summarisation

EB, EDIT Building, Hörsalsvägen 11

Opponent: Paris Carbone, KTH Royal Institute of Technology

Online defence

Author

Martin Hilgendorf

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Other publications Research

LMQ-Sketch: Lagom Multi-Query Sketch for High-Rate Online Analytics

Leibniz International Proceedings in Informatics, LIPIcs,;Vol. 356(2025)p. 36:1-36:24

Paper in proceeding

ReSketch: A Mergeable, Partitionable, and Resizable Sketch

FORTE: an extensible framework for robustness and efficiency in data transfer pipelines

DEBS 2023 - Proceedings of the 17th ACM International Conference on Distributed and Event-based Systems,;(2023)p. 139-150

Paper in proceeding

VR EPITOME - Summarization and structuring of continuous data in concurrent processing pipelines

Swedish Research Council (VR) (2021-05424), 2022-01-01 -- 2025-12-31.

Show Project

Subject Categories (SSIF 2025)

Computer Sciences

Networked, Parallel and Distributed Computing

Publisher

Chalmers