Concurrent Data Structures for Efficient Streaming Aggregation
Rapport, 2013
In many data gathering applications, information arrives in the form of continuous streams rather than finite data sets.
Efficient one-pass algorithms are required to cope with high input loads.
Stream processing engines support continuous queries to process data in a real-time fashion and have evolved rapidly from centralized to distributed, parallel and elastic solutions.
While a big effort has been put on leveraging the processing capacity of clusters of machines, less work has focused on leveraging the parallelism enabled by multi-core architectures by means of concurrent and lock-free data structures, to support the pipeline.
This paper explores this aspect focusing on multiway aggregation, where large data volumes are received from multiple input streams.
Multiway aggregation is crucial in contexts such as sensor networks, social media or clickstream analysis applications.
We provide three enhanced aggregate operators that rely on two new concurrent data structures and their lock-free implementations, supporting both order-sensitive and order-insensitive aggregation functions.
We provide an extensive study of the properties of the proposed aggregate operators and the new data structures.
We also show an extensive experimental evaluation of the proposed methods, giving empirical evidence of their superiority.
In this evaluation we run a variety of aggregation queries on two large datasets, one with data extracted from SoundCloud, a music social network, and one with data from a smart grid metering network.
In all the experiments, the new data structures improved the aggregation performance significantly, up to one order of magnitude, in terms of both processing throughput and latency.