Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization
Doctoral thesis, 2022
The datasets in modern IoT-based applications are getting more and more challenging for conventional clustering schemes. Big Data is a term used to loosely describe hard-to-manage datasets. Particularly, large numbers of data points, high rates of data production, large numbers of dimensions, high skewness, and distributed data sources are aspects that challenge the classical data processing schemes, including clustering methods.
This thesis contributes to efficient big data clustering for distributed and parallel computing architectures, representative of the processing environments in edge-cloud computing continuum. The thesis also proposes approximation techniques to cope with certain challenging aspects of big data.
Regarding distributed clustering, the thesis proposes MAD-C, abbreviating Multi-stage Approximate Distributed Cluster-Combining. MAD-C leverages an approximation-based data synopsis that drastically lowers the required communication bandwidth among the distributed nodes and achieves multiplicative savings in computation time, compared to a baseline that centrally gathers and clusters the data. The thesis shows MAD-C can be used to detect and localize objects using data from distributed 3D laser-based sensors with high accuracy. Furthermore, the work in the thesis shows how to utilize MAD-C to efficiently detect the objects within a restricted area for geofencing purposes.
Regarding parallel clustering, the thesis proposes a family of algorithms called PARMA-CC, abbreviating Parallel Multistage Approximate Cluster Combining. Using approximation-based data synopsis, PARMA-CC algorithms achieve scalability on multi-core systems by facilitating parallel execution of threads with limited dependencies which get resolved using fine-grained synchronization techniques. To further enhance the efficiency, PARMA-CC algorithms can be configured with respect to different data properties. Analytical and empirical evaluations show PARMA-CC algorithms achieve significantly higher scalability than the state-of-the-art methods while preserving a high accuracy.
On parallel high dimensional clustering, the thesis proposes IP.LSH.DBSCAN, abbreviating Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing (LSH). IP.LSH.DBSCAN fuses the process of creating an LSH index into the process of data clustering, and it takes advantage of data parallelization and fine-grained synchronization. Analytical and empirical evaluations show IP.LSH.DBSCAN facilitates parallel density-based clustering of massive datasets using desired distance measures resulting in several orders of magnitude lower latency than state-of-the-art for high dimensional data.
In essence, the thesis proposes methods and algorithmic implementations targeting the problem of big data clustering and applications using distributed and parallel processing. The proposed methods (available as open source software) are extensible and can be used in combination with other methods.
Clustering
Approximation-based synopsis
Distributed and Parallel Processing
Applied ML
Author
Amir Keramatian
Network and Systems
MAD-C: Multi-stage Approximate Distributed Cluster-combining for obstacle detection and localization
Journal of Parallel and Distributed Computing,;Vol. 147(2021)p. 248-267
Journal article
MAD-C: Multi-stage Approximate Distributed Cluster-Combining for Obstacle Detection and Localization
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 11339(2019)p. 312-324
Paper in proceeding
PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms
Journal of Parallel and Distributed Computing,;Vol. 177(2023)p. 68-88
Journal article
PARMA-CC: Parallel Multiphase Approximate Cluster Combining
ACM International Conference Proceeding Series,;(2020)
Paper in proceeding
IP. LSH. DBSCAN : Integrated Parallel Density-Based Clustering Through Locality-Sensitive Hashing
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 13440 LNCS(2022)p. 268-284
Paper in proceeding
The booming increase in the size of Big data has rendered many popular data clustering algorithms impractical. It might take days or even months to complete processing of certain clustering algorithms on big datasets containing millions or billions of observations.
The thesis contributes to big data clustering by proposing efficient algorithms that can cope with big data challenges. To dramatically lower the processing time, the proposed algorithms utilize approximation and leverage distributed and parallel processing techniques. The results show the proposed algorithms are not only orders of magnitude faster than the existing state of the art algorithms but also highly accurate. The proposed algorithms can be utilized in production environments as they can efficiently process data from LiDAR, GPS, and other sensors. For example, the work in the thesis can be used to efficiently detect the objects within arbitrary boundaries in an environment scanned by LiDAR sensors.
Future factories in the Cloud (FiC)
Swedish Foundation for Strategic Research (SSF) (GMT14-0032), 2016-01-01 -- 2020-12-31.
Subject Categories
Computer Engineering
Computational Mathematics
Transport Systems and Logistics
Computer Science
Computer Systems
Areas of Advance
Information and Communication Technology
Driving Forces
Sustainable development
Innovation and entrepreneurship
ISBN
978-91-7905-650-6
Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5116
Publisher
Chalmers
EB Hörsal, EDIT building, Hörsalsvägen 11
Opponent: Associate Professor Ioannis Chatzigiannakis