A Specialized Memory Hierarchy for Stream Aggregation
Paper in proceeding, 2021
High throughput stream aggregation is essential for many applications that analyze massive volumes of data. Incoming data need to be stored in a sliding window before processing, in case the aggregation functions cannot be computed incrementally. However, this puts tremendous pressure on the memory bandwidth and capacity. GPU and CPU memory management is inefficient for this task as it introduces unnecessary data movement that wastes bandwidth. FPGAs can make more efficient use of their memory but existing approaches employ either only on-chip memory (i.e. SRAM) or only off-chip memory (i.e. DRAM) to store the aggregated values. The high on-chip SRAM bandwidth enables line-rate processing, but only for small problem sizes due to the limited capacity. The larger off-chip DRAM size supports larger problems, but falls short on performance due to lower bandwidth. This paper introduces a specialized memory hierarchy for stream aggregation. It employs multiple memory levels with different characteristics to offer both high bandwidth and capacity. In doing so, larger stream aggregation problems can be supported, i.e. large number of concurrently active keys and large sliding windows, at line-rate performance. A 3-level implementation of the proposed memory hierarchy is used in a reconfigurable stream aggregation dataflow engine (DFE), outperforming existing competing solutions. Compared to designs with only on-chip memory, our approach supports 4 orders of magnitude larger problems. Compared to designs that use only DRAM, our design achieves up to 8x higher throughput.