PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms

Amir Keramatian; Vincenzo Massimiliano Gulisano; Marina Papatriantafilou; Philippas Tsigas

doi:10.1016/j.jpdc.2023.02.001

PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms
Artikel i vetenskaplig tidskrift, 2023

Clustering is a common task in data analysis applications. Despite the extensive literature, the continuously increasing volumes of data produced by sensors (e.g., rates of several MB/s by 3D scanners such as LIDAR sensors), and the time-sensitivity of the applications leveraging the clustering outcomes (e.g., detecting critical situations such as detecting boundary crossing from a robot arm that could injure human beings) demand for efficient data clustering algorithms that can effectively utilize the increasing computational capacities of modern hardware. To that end, we leverage approximation and parallelization, where the former is to scale down the amount of data, and the latter is to scale up the computation. Regarding parallelization, we explore a design space for synchronization and workload distribution among the threads. As we study different parts of the design space, we propose representative Parallel Multiphase Approximate Cluster Combining, abbreviated as PARMA-CC, algorithms.

We show that PARMA-CC algorithms yield equivalent clustering outcomes despite their different approaches. Furthermore, we show that certain PARMA-CC algorithms can achieve higher efficiency with respect to certain properties of the data to be clustered. Generally speaking, in PARMA-CC algorithms, parallel threads compute summaries associated with clusters of data (sub)sets. As the threads concurrently combine the summaries, they construct a comprehensive summary of the sets of clusters. By approximating a cluster with its respective geometrical summaries, PARMA-CC algorithms scale well with increased data volumes, and, by computing and efficiently combining the summaries in parallel, they enable latency improvements. PARMA-CC algorithms utilize special data structures that enable parallelism through in-place data processing. As we show in our analysis and evaluation, PARMA-CC algorithms can complement and outperform well-established methods, with significantly better scalability, while still providing highly accurate results in a variety of data sets, even with skewed data distributions, which cause the traditional approaches to exhibit their worst-case behaviour.

Synchronization

Data Structures

Parallel Clustering

Approximation

Författare

Amir Keramatian

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Forskning Andra publikationer

Vincenzo Massimiliano Gulisano

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Forskning Andra publikationer

Marina Papatriantafilou

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Forskning Andra publikationer

Philippas Tsigas

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Forskning Andra publikationer

Journal of Parallel and Distributed Computing

0743-7315 (ISSN) 1096-0848 (eISSN)

Vol. 177 68-88

HAREN: Självdistribuerad och anpassningsbar dataströmningsanalys i dimman

Vetenskapsrådet (VR) (2016-03800), 2017-01-01 -- 2020-12-31.

Visa projekt

Molnbaserade produkter och produktion (FiC)

Stiftelsen för Strategisk forskning (SSF) (GMT14-0032), 2016-01-01 -- 2020-12-31.

Visa projekt

Ämneskategorier (SSIF 2011)

Datorteknik

Mediateknik

Datavetenskap (datalogi)

Datorsystem

Styrkeområden

Informations- och kommunikationsteknik

Produktion

Drivkrafter

Hållbar utveckling

DOI

10.1016/j.jpdc.2023.02.001

Publikationsdata kopplat till DOI

Mer information

Senast uppdaterat

2026-03-24

PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms Artikel i vetenskaplig tidskrift, 2023

Författare

Amir Keramatian

Vincenzo Massimiliano Gulisano

Marina Papatriantafilou

Philippas Tsigas

Journal of Parallel and Distributed Computing

HAREN: Självdistribuerad och anpassningsbar dataströmningsanalys i dimman

Molnbaserade produkter och produktion (FiC)

Ämneskategorier (SSIF 2011)

Styrkeområden

Drivkrafter

DOI

Mer information

Senast uppdaterat

PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms
Artikel i vetenskaplig tidskrift, 2023