Efficient, Adaptable, and Scalable Synopses for Data-Intensive Systems
Licentiate thesis, 2026
Targeting these challenges, this thesis studies two core summarization primitives, heavy-hitter detection and frequency estimation, and contributes as follows. Chapter A analyzes the trade-offs among throughput, memory usage, and accuracy in heavy-hitter detection algorithms; the insights led to the design of the Cuckoo Heavy Keeper (CHK) algorithm, which introduces a process for distinguishing frequent from infrequent items that unlocks synergies inaccessible to conventional approaches, such as reduced per-item instruction cost and improved cache behavior. Chapter A also introduces a categorization of parallelization approaches and the multi-CHK (mCHK) framework, which can parallelize any sequential heavy-hitter algorithm, with support for concurrent updates and queries. Chapter B identifies three properties that target the above challenges: resizability (adjusting memory at runtime), enhanced mergeability (combining differently-sized summaries), and partitionability (splitting state for elastic scaling and load rebalancing). Building on these properties, Chapter B proposes ReSketch, a frequency estimation sketch design that achieves all three while maintaining a beneficial memory-to-accuracy ratio, together with the instance provenance DAG, which tracks how approximation bounds evolve through arbitrary sequences of these operations. Together, these results provide complementary building blocks for efficient, adaptable, and scalable summarization in modern data-intensive systems.
Scalability
Efficiency
Synopsis
Data-Intensive Systems
Adaptability
Concurrency & Parallelism
Data Summarization
Author
Quang Vinh Ngo
Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems
Cuckoo Heavy Keeper and the balancing act of maintaining heavy hitters in stream processing
Proceedings of the VLDB Endowment,;Vol. 18(2025)p. 3149-3161
Paper in proceeding
ReSketch: A Mergeable, Partitionable, and Resizable Sketch
Relaxed Semantics Across the Data Analytics Stack (RELAX-DN)
European Commission (EC) (EC/HE/101072456), 2023-03-01 -- 2027-03-01.
Subject Categories (SSIF 2025)
Computer Sciences
Computer Engineering
Computer Systems
Technical report L - Department of Computer Science and Engineering, Chalmers University of Technology and Göteborg University
Publisher
Chalmers
Room ED, The EDIT building, Chalmers University of Technology (Campus Johanneberg)
Opponent: Prof. Papapetrou Odysseas, Eindhoven University of Technology, The Netherlands