S-RASTER: contraction clustering for evolving data streams
Artikel i vetenskaplig tidskrift, 2020

Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.

Clustering

Big data analytics

Unsupervised learning

Big data

Machine learning

Stream processing

Författare

Gregor Ulm

Stiftelsen Fraunhofer-Chalmers Centrum för Industrimatematik

Fraunhofer Center for Machine Learning

Simon Smith

Fraunhofer Center for Machine Learning

Stiftelsen Fraunhofer-Chalmers Centrum för Industrimatematik

Adrian Nilsson

Fraunhofer Center for Machine Learning

Stiftelsen Fraunhofer-Chalmers Centrum för Industrimatematik

Emil Gustavsson

Stiftelsen Fraunhofer-Chalmers Centrum för Industrimatematik

Fraunhofer Center for Machine Learning

Mats Jirstrand

Fraunhofer Center for Machine Learning

Stiftelsen Fraunhofer-Chalmers Centrum för Industrimatematik

Journal of Big Data

2196-1115 (eISSN)

Vol. 7 1 62

Ämneskategorier

Datorteknik

Systemvetenskap

Datorsystem

DOI

10.1186/s40537-020-00336-3

Mer information

Senast uppdaterat

2021-03-23