Reconfigurable-Hardware Accelerated Stream Aggregation
Doctoral thesis, 2022

High throughput and low latency stream aggregation is essential for many applications that analyze massive volumes of data in real-time. Incoming data need to be stored in a single sliding-window before processing, in cases where incremental aggregations are wasteful or not possible at all. However, storing all incoming values in a single-window puts tremendous pressure on the memory bandwidth and capacity. GPU and CPU memory management is inefficient for this task as it introduces unnecessary data movement that wastes bandwidth. FPGAs can make more efficient use of their memory but existing approaches employ only on-chip memory and therefore, can only support small problem sizes (i.e. small sliding windows and number of keys) due to the limited capacity. This thesis addresses the above limitations of stream processing systems by proposing techniques for accelerating single sliding-window stream aggregation using FPGAs to achieve line-rate processing throughput and ultra low latency.
It does so first by building accelerators using FPGAs and second, by alleviating the memory pressure posed by single-window stream aggregation. The initial part of this thesis presents the accelerators for both windowing policies, namely, tuple- and time- based, using Maxeler's DataFlow Engines (DFEs) which have a direct feed of incoming data from the network as well as direct access to off-chip DRAM. Compared to state-of-the-art stream processing software system, the DFEs offer 1-2 orders of magnitude higher processing throughput and 4 orders of magnitude lower latency.
The later part of this thesis focuses on alleviating the memory pressure due to the various steps in single-window stream aggregation. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. The high on-chip SRAM bandwidth enables line-rate processing, but only for small problem sizes due to the limited capacity. The larger off-chip DRAM size supports larger problems, but falls short on performance due to lower bandwidth. In order to bridge this gap, this thesis introduces a specialized memory hierarchy for stream aggregation. It employs Multi-Level Queues (MLQs) spanning across multiple memory levels with different characteristics to offer both high bandwidth and capacity. In doing so, larger stream aggregation problems can be supported at line-rate performance, outperforming existing competing solutions. Compared to designs with only on-chip memory, our approach supports 4 orders of magnitude larger problems. Compared to designs that use only DRAM, our design achieves up to 8x higher throughput.
Finally, this thesis aims to alleviate the memory pressure due to the window-aggregation step. Although window-updates can be supported efficiently using MLQs, frequent window-aggregations remain a performance bottleneck. This thesis addresses this problem by introducing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding-windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to fixed- as well as floating- point numbers. Compared to designs using MLQs, StreamZip lossless and lossy designs achieve up to 7.5x and 22x higher throughput, while improving the effective memory capacity by up to 5x and 23x, respectively.

Stream

Reconfigurable Computing

Aggregation

Dataflow

FPGA

Memory Hierarchy

Compression

CSE EDIT 8103
Opponent: Dr. Dirk Koch, University of Manchester, UK

Author

Prajith Ramakrishnan Geethakumari

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

Single Window Stream Aggregation using Reconfigurable Hardware

2017 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY (ICFPT),;(2017)p. 112-119

Paper in proceeding

Time-SWAD: A dataflow engine for time-based single window stream aggregation

Proceedings - 2019 International Conference on Field-Programmable Technology, ICFPT 2019,;Vol. 2019-December(2019)p. 72-80

Paper in proceeding

A Specialized Memory Hierarchy for Stream Aggregation

2021 31ST INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL 2021),;(2021)p. 204-210

Paper in proceeding

Streamzip: Compressed Sliding-Windows for Stream Aggregation

2021 International Conference on Field-Programmable Technology, ICFPT 2021,;(2021)p. 203-211

Paper in proceeding

With the recent technological advances, the number of connected devices grows rapidly along with the total amount of data they produce. Processing such big-data brings tremendous opportunities in various domains (e.g. financial, transportation) enabling real-time decisions. There is an insatiable need to summarise or aggregate this huge volume of data to make sophisticated decisions on the fly. The data stream model of computation involves answering such continuous aggregate queries over an infinite stream of data elements. Examples of such queries are average vehicle speeds per minute, or continuous user-behaviour statistics during online sessions. However, real-time analytics of such large data streams require high processing throughput to cope with massive volumes of data as well as low latency to respond in real-time.

While the performance increase of general-purpose computing platforms is not able to keep up with the ever increasing generation rates of real life data streams, reconfigurable logic in FPGAs provide several unique strengths to provide faster solutions. When attached directly to the network, FPGAs can provide line-rate processing throughput with very low latency, because data no longer needs to traverse various software layers of the networking and application stack as in general purpose computers. With direct memory connectivity and custom accelerators mapped on to the reconfigurable fabric of the FPGA, unnecessary data movement across the memory hierarchy can be further minimised. Moreover, the computations in FPGA tend to be more energy-efficient because of the reduced data movement and that the FPGA operates at relatively low clock frequencies. This thesis aims to utilize these strengths of FPGA and proposes novel accelerators and memory management techniques for stream aggregation using reconfigurable-hardware.

A Novel, Comprehensible, Ultra-Fast, Security-Aware CPS Simulator (COSSIM)

European Commission (EC) (EC/H2020/644042), 2014-01-01 -- 2018-12-31.

ScalaNetS: Skalbara nätverks- och dataströmsberäkningar

Swedish Research Council (VR) (Dnr2016-05231), 2017-01-01 -- 2020-12-31.

Areas of Advance

Information and Communication Technology

Subject Categories (SSIF 2011)

Electrical Engineering, Electronic Engineering, Information Engineering

ISBN

978-91-7905-610-0

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5076

Publisher

Chalmers

CSE EDIT 8103

Online

Opponent: Dr. Dirk Koch, University of Manchester, UK

More information

Latest update

3/2/2022 3