Clustering in the Big Data Era: methods for efficient approximation, distribution, and parallelization
Doktorsavhandling, 2022

Data clustering is an unsupervised machine learning task whose objective is to group together similar items. As a versatile data mining tool, data clustering has numerous applications, such as object detection and localization using data from 3D laser-based sensors, finding popular routes using geolocation data, and finding similar patterns of electricity consumption using smart meters.

The datasets in modern IoT-based applications are getting more and more challenging for conventional clustering schemes. Big Data is a term used to loosely describe hard-to-manage datasets. Particularly, large numbers of data points, high rates of data production, large numbers of dimensions, high skewness, and distributed data sources are aspects that challenge the classical data processing schemes, including clustering methods.

This thesis contributes to efficient big data clustering for distributed and parallel computing architectures, representative of the processing environments in edge-cloud computing continuum. The thesis also proposes approximation techniques to cope with certain challenging aspects of big data.

Regarding distributed clustering, the thesis proposes MAD-C, abbreviating Multi-stage Approximate Distributed Cluster-Combining. MAD-C leverages an approximation-based data synopsis that drastically lowers the required communication bandwidth among the distributed nodes and achieves multiplicative savings in computation time, compared to a baseline that centrally gathers and clusters the data. The thesis shows MAD-C can be used to detect and localize objects using data from distributed 3D laser-based sensors with high accuracy. Furthermore, the work in the thesis shows how to utilize MAD-C to efficiently detect the objects within a restricted area for geofencing purposes.

Regarding parallel clustering, the thesis proposes a family of algorithms called PARMA-CC, abbreviating Parallel Multistage Approximate Cluster Combining. Using approximation-based data synopsis, PARMA-CC algorithms achieve scalability on multi-core systems by facilitating parallel execution of threads with limited dependencies which get resolved using fine-grained synchronization techniques. To further enhance the efficiency, PARMA-CC algorithms can be configured with respect to different data properties. Analytical and empirical evaluations show PARMA-CC algorithms achieve significantly higher scalability than the state-of-the-art methods while preserving a high accuracy.

On parallel high dimensional clustering, the thesis proposes IP.LSH.DBSCAN, abbreviating Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing (LSH). IP.LSH.DBSCAN fuses the process of creating an LSH index into the process of data clustering, and it takes advantage of data parallelization and fine-grained synchronization. Analytical and empirical evaluations show IP.LSH.DBSCAN facilitates parallel density-based clustering of massive datasets using desired distance measures resulting in several orders of magnitude lower latency than state-of-the-art for high dimensional data.

In essence, the thesis proposes methods and algorithmic implementations targeting the problem of big data clustering and applications using distributed and parallel processing. The proposed methods (available as open source software) are extensible and can be used in combination with other methods.

Applied ML

Clustering

Approximation-based synopsis

Distributed and Parallel Processing

EB Hörsal, EDIT building, Hörsalsvägen 11
Opponent: Associate Professor Ioannis Chatzigiannakis

Författare

Amir Keramatian

Nätverk och System

MAD-C: Multi-stage Approximate Distributed Cluster-combining for obstacle detection and localization

Journal of Parallel and Distributed Computing,; Vol. 147(2021)p. 248-267

Artikel i vetenskaplig tidskrift

MAD-C: Multi-stage Approximate Distributed Cluster-Combining for Obstacle Detection and Localization

Lecture Notes in Computer Science,; Vol. 11339(2019)p. 312-324

Paper i proceeding

Amir Keramatian, Vincenzo Gulisano, Marina Papatriantafilou, Philippas Tsigas, “PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms”, the Journal of Parallel and Distributed Computing (JPDC), Under Review After Minor Revision, Elsevier, 2022.

PARMA-CC: Parallel Multiphase Approximate Cluster Combining

ICDCN 2020: Proceedings of the 21st International Conference on Distributed Computing and Networking,; (2020)

Paper i proceeding

Amir Keramatian, Vincenzo Gulisano, Marina Papatriantafilou, Philippas Tsigas, “IP.LSH.DBSCAN: Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing”, Under Review.

Molnbaserade produkter och produktion (FiC)

Stiftelsen för Strategisk forskning (SSF) (GMT14-0032), 2016-01-01 -- 2020-12-31.

Ämneskategorier

Datorteknik

Datavetenskap (datalogi)

Datorsystem

Styrkeområden

Informations- och kommunikationsteknik

Drivkrafter

Hållbar utveckling

Innovation och entreprenörskap

ISBN

978-91-7905-650-6

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5116

Utgivare

Chalmers

EB Hörsal, EDIT building, Hörsalsvägen 11

Online

Opponent: Associate Professor Ioannis Chatzigiannakis

Mer information

Senast uppdaterat

2022-05-09