ScaleJoin: a deterministic, disjoint-parallel and skew-resilient stream join enabled by concurrent data structures
Motivated by the inherently high computational complexity of stream joins, a considerable research effort has been devoted to their parallelization. Significant increase in processing throughput has been achieved by methods that utilize parallelization features enabled by the hardware. At the same time, challenging aspects such as deterministic processing, are only partially addressed.
In this work we tackle the parallelization challenges of stream joins from a different design perspective: we study the points where data is exchanged and shared and by analyzing this, we identify the need for a balance between the amount of independent action that can be taken by processing entities (be it processor units or processing threads) and the synchronization needed to guarantee deterministic processing. We propose concurrent shared data objects that can satisfy these needs. At the same time, we provide algorithmic implementations of the objects that lead to a deterministic highly parallel stream join. In particular, we present ScaleJoin, a parallel stream join built upon ScaleGate, which is a new concurrent abstract data type acting at the articulation points maintaining the tuples being consumed and produced in a deterministic fashion regardless of the number of processing threads or the number of physical streams delivering them. ScaleJoin allows for the parallel execution of an arbitrary number of sequential stream joins while distributing the overall work among them without assuming any centralized coordinator. As we show in our evaluation, ScaleJoin does not only enforce deterministic processing while providing disjoint and skew-resilient parallelism, but also achieves higher processing throughput and lower processing latency than state of the art parallel stream joins such as CellJoin and Handshake.