Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization
The datasets in modern IoT-based applications are getting more and more challenging for conventional clustering schemes. Big Data is a term used to loosely describe hard-to-manage datasets. Particularly, large numbers of data points, high rates of data production, large numbers of dimensions, high skewness, and distributed data sources are aspects that challenge the classical data processing schemes, including clustering methods.
This thesis contributes to efficient big data clustering for distributed and parallel computing architectures, representative of the processing environments in edge-cloud computing continuum. The thesis also proposes approximation techniques to cope with certain challenging aspects of big data.
Regarding distributed clustering, the thesis proposes MAD-C, abbreviating Multi-stage Approximate Distributed Cluster-Combining. MAD-C leverages an approximation-based data synopsis that drastically lowers the required communication bandwidth among the distributed nodes and achieves multiplicative savings in computation time, compared to a baseline that centrally gathers and clusters the data. The thesis shows MAD-C can be used to detect and localize objects using data from distributed 3D laser-based sensors with high accuracy. Furthermore, the work in the thesis shows how to utilize MAD-C to efficiently detect the objects within a restricted area for geofencing purposes.
Regarding parallel clustering, the thesis proposes a family of algorithms called PARMA-CC, abbreviating Parallel Multistage Approximate Cluster Combining. Using approximation-based data synopsis, PARMA-CC algorithms achieve scalability on multi-core systems by facilitating parallel execution of threads with limited dependencies which get resolved using fine-grained synchronization techniques. To further enhance the efficiency, PARMA-CC algorithms can be configured with respect to different data properties. Analytical and empirical evaluations show PARMA-CC algorithms achieve significantly higher scalability than the state-of-the-art methods while preserving a high accuracy.
On parallel high dimensional clustering, the thesis proposes IP.LSH.DBSCAN, abbreviating Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing (LSH). IP.LSH.DBSCAN fuses the process of creating an LSH index into the process of data clustering, and it takes advantage of data parallelization and fine-grained synchronization. Analytical and empirical evaluations show IP.LSH.DBSCAN facilitates parallel density-based clustering of massive datasets using desired distance measures resulting in several orders of magnitude lower latency than state-of-the-art for high dimensional data.
In essence, the thesis proposes methods and algorithmic implementations targeting the problem of big data clustering and applications using distributed and parallel processing. The proposed methods (available as open source software) are extensible and can be used in combination with other methods.
Distributed and Parallel Processing
Nätverk och System
MAD-C: Multi-stage Approximate Distributed Cluster-combining for obstacle detection and localization
Journal of Parallel and Distributed Computing,; Vol. 147(2021)p. 248-267
Artikel i vetenskaplig tidskrift
MAD-C: Multi-stage Approximate Distributed Cluster-Combining for Obstacle Detection and Localization
Lecture Notes in Computer Science,; Vol. 11339(2019)p. 312-324
Paper i proceeding
Amir Keramatian, Vincenzo Gulisano, Marina Papatriantafilou, Philippas Tsigas, “PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms”, the Journal of Parallel and Distributed Computing (JPDC), Under Review After Minor Revision, Elsevier, 2022.
PARMA-CC: Parallel Multiphase Approximate Cluster Combining
ICDCN 2020: Proceedings of the 21st International Conference on Distributed Computing and Networking,; (2020)
Paper i proceeding
Amir Keramatian, Vincenzo Gulisano, Marina Papatriantafilou, Philippas Tsigas, “IP.LSH.DBSCAN: Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing”, Under Review.
Molnbaserade produkter och produktion (FiC)
Stiftelsen för Strategisk forskning (SSF) (GMT14-0032), 2016-01-01 -- 2020-12-31.
Informations- och kommunikationsteknik
Innovation och entreprenörskap
Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5116
EB Hörsal, EDIT building, Hörsalsvägen 11
Opponent: Associate Professor Ioannis Chatzigiannakis