Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization

Amir Keramatian

Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization
Doctoral thesis, 2022

Data clustering is an unsupervised machine learning task whose objective is to group together similar items. As a versatile data mining tool, data clustering has numerous applications, such as object detection and localization using data from 3D laser-based sensors, finding popular routes using geolocation data, and finding similar patterns of electricity consumption using smart meters.

The datasets in modern IoT-based applications are getting more and more challenging for conventional clustering schemes. Big Data is a term used to loosely describe hard-to-manage datasets. Particularly, large numbers of data points, high rates of data production, large numbers of dimensions, high skewness, and distributed data sources are aspects that challenge the classical data processing schemes, including clustering methods.

This thesis contributes to efficient big data clustering for distributed and parallel computing architectures, representative of the processing environments in edge-cloud computing continuum. The thesis also proposes approximation techniques to cope with certain challenging aspects of big data.

Regarding distributed clustering, the thesis proposes MAD-C, abbreviating Multi-stage Approximate Distributed Cluster-Combining. MAD-C leverages an approximation-based data synopsis that drastically lowers the required communication bandwidth among the distributed nodes and achieves multiplicative savings in computation time, compared to a baseline that centrally gathers and clusters the data. The thesis shows MAD-C can be used to detect and localize objects using data from distributed 3D laser-based sensors with high accuracy. Furthermore, the work in the thesis shows how to utilize MAD-C to efficiently detect the objects within a restricted area for geofencing purposes.

Regarding parallel clustering, the thesis proposes a family of algorithms called PARMA-CC, abbreviating Parallel Multistage Approximate Cluster Combining. Using approximation-based data synopsis, PARMA-CC algorithms achieve scalability on multi-core systems by facilitating parallel execution of threads with limited dependencies which get resolved using fine-grained synchronization techniques. To further enhance the efficiency, PARMA-CC algorithms can be configured with respect to different data properties. Analytical and empirical evaluations show PARMA-CC algorithms achieve significantly higher scalability than the state-of-the-art methods while preserving a high accuracy.

On parallel high dimensional clustering, the thesis proposes IP.LSH.DBSCAN, abbreviating Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing (LSH). IP.LSH.DBSCAN fuses the process of creating an LSH index into the process of data clustering, and it takes advantage of data parallelization and fine-grained synchronization. Analytical and empirical evaluations show IP.LSH.DBSCAN facilitates parallel density-based clustering of massive datasets using desired distance measures resulting in several orders of magnitude lower latency than state-of-the-art for high dimensional data.

In essence, the thesis proposes methods and algorithmic implementations targeting the problem of big data clustering and applications using distributed and parallel processing. The proposed methods (available as open source software) are extensible and can be used in combination with other methods.

Clustering

Approximation-based synopsis

Distributed and Parallel Processing

Applied ML

EB Hörsal, EDIT building, Hörsalsvägen 11

Opponent: Associate Professor Ioannis Chatzigiannakis

Online defence

Author

Amir Keramatian

Network and Systems

Other publications Research

MAD-C: Multi-stage Approximate Distributed Cluster-combining for obstacle detection and localization

Journal of Parallel and Distributed Computing,;Vol. 147(2021)p. 248-267

Journal article

MAD-C: Multi-stage Approximate Distributed Cluster-Combining for Obstacle Detection and Localization

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 11339(2019)p. 312-324

Paper in proceeding

PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms

Journal of Parallel and Distributed Computing,;Vol. 177(2023)p. 68-88

Journal article

PARMA-CC: Parallel Multiphase Approximate Cluster Combining

ACM International Conference Proceeding Series,;(2020)

Paper in proceeding

IP. LSH. DBSCAN : Integrated Parallel Density-Based Clustering Through Locality-Sensitive Hashing

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),;Vol. 13440 LNCS(2022)p. 268-284

Paper in proceeding

Data clustering, the task of discovering groups of similar observations, naturally appears in many fields, such as biology, zoology, sociology, geology and engineering. Versatility of data clustering has turned it into a fundamental data mining tool with numerous applications. For example, given a corpus of news articles, data clustering can be used to find articles on the same topics, or given the readings from a 3D sensor (e.g., LiDAR), data clustering can detect the objects in the surroundings of the sensor.

The booming increase in the size of Big data has rendered many popular data clustering algorithms impractical. It might take days or even months to complete processing of certain clustering algorithms on big datasets containing millions or billions of observations.

The thesis contributes to big data clustering by proposing efficient algorithms that can cope with big data challenges. To dramatically lower the processing time, the proposed algorithms utilize approximation and leverage distributed and parallel processing techniques. The results show the proposed algorithms are not only orders of magnitude faster than the existing state of the art algorithms but also highly accurate. The proposed algorithms can be utilized in production environments as they can efficiently process data from LiDAR, GPS, and other sensors. For example, the work in the thesis can be used to efficiently detect the objects within arbitrary boundaries in an environment scanned by LiDAR sensors.

Future factories in the Cloud (FiC)

Swedish Foundation for Strategic Research (SSF) (GMT14-0032), 2016-01-01 -- 2020-12-31.

Show Project

Subject Categories (SSIF 2011)

Computer Engineering

Computational Mathematics

Transport Systems and Logistics

Computer Science

Computer Systems

Areas of Advance

Information and Communication Technology

Driving Forces

Sustainable development

Innovation and entrepreneurship

ISBN

978-91-7905-650-6

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5116

Publisher

Chalmers

EB Hörsal, EDIT building, Hörsalsvägen 11

Online

Opponent: Associate Professor Ioannis Chatzigiannakis

More information

Latest update

11/12/2023

Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization Doctoral thesis, 2022

Author

Amir Keramatian

MAD-C: Multi-stage Approximate Distributed Cluster-combining for obstacle detection and localization

MAD-C: Multi-stage Approximate Distributed Cluster-Combining for Obstacle Detection and Localization

PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms

PARMA-CC: Parallel Multiphase Approximate Cluster Combining

IP. LSH. DBSCAN : Integrated Parallel Density-Based Clustering Through Locality-Sensitive Hashing

Future factories in the Cloud (FiC)

Subject Categories (SSIF 2011)

Areas of Advance

Driving Forces

ISBN

Publisher

More information

Latest update

Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization
Doctoral thesis, 2022