Efficient, Adaptable, and Scalable Synopses for Data-Intensive Systems

Quang Vinh Ngo

Efficient, Adaptable, and Scalable Synopses for Data-Intensive Systems
Licentiate thesis, 2026

Data-intensive systems generate data at rates and volumes that demand timely single-pass analysis with bounded memory. However, computing exact statistics in a single pass often requires memory that grows with the data under consideration; this is manageable for small windows, but becomes infeasible as the scope or number of computations grows. Data summarization, or synopses, addresses this by capturing statistics in sublinear space with bounded error, trading accuracy for memory and throughput. Although they are well-established algorithmically, practical use depends on hardware and deployment details. Most synopsis algorithms are sequential, yet production data rates often exceed a single core’s capacity; throughput and latency also depend on cache behavior and access patterns, not just asymptotic complexity. Furthermore, they are initialized at fixed settings and approximation bounds, and cannot adapt to changing budgets and workloads. Multi-core and distributed processors can absorb higher rates, but they bring contention, cache coherence costs, state migration when scaling, and require merging differently-sized summaries.

Targeting these challenges, this thesis studies two core summarization primitives, heavy-hitter detection and frequency estimation, and contributes as follows. Chapter A analyzes the trade-offs among throughput, memory usage, and accuracy in heavy-hitter detection algorithms; the insights led to the design of the Cuckoo Heavy Keeper (CHK) algorithm, which introduces a process for distinguishing frequent from infrequent items that unlocks synergies inaccessible to conventional approaches, such as reduced per-item instruction cost and improved cache behavior. Chapter A also introduces a categorization of parallelization approaches and the multi-CHK (mCHK) framework, which can parallelize any sequential heavy-hitter algorithm, with support for concurrent updates and queries. Chapter B identifies three properties that target the above challenges: resizability (adjusting memory at runtime), enhanced mergeability (combining differently-sized summaries), and partitionability (splitting state for elastic scaling and load rebalancing). Building on these properties, Chapter B proposes ReSketch, a frequency estimation sketch design that achieves all three while maintaining a beneficial memory-to-accuracy ratio, together with the instance provenance DAG, which tracks how approximation bounds evolve through arbitrary sequences of these operations. Together, these results provide complementary building blocks for efficient, adaptable, and scalable summarization in modern data-intensive systems.

Synopsis

Data-Intensive Systems

Efficiency

Adaptability

Scalability

Concurrency & Parallelism

Data Summarization

Room ED, The EDIT building, Chalmers University of Technology (Campus Johanneberg)

Opponent: Prof. Papapetrou Odysseas, Eindhoven University of Technology, The Netherlands

Online defence

Author

Quang Vinh Ngo

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Other publications Research

Cuckoo Heavy Keeper and the balancing act of maintaining heavy hitters in stream processing

Proceedings of the VLDB Endowment,;Vol. 18(2025)p. 3149-3161

Paper in proceeding

ReSketch: A Mergeable, Partitionable, and Resizable Sketch

Relaxed Semantics Across the Data Analytics Stack (RELAX-DN)

European Commission (EC) (EC/HE/101072456), 2023-03-01 -- 2027-03-01.

Show Project

Subject Categories (SSIF 2025)

Computer Sciences

Computer Engineering

Computer Systems

Technical report L - Department of Computer Science and Engineering, Chalmers University of Technology and Göteborg University

Publisher

Chalmers