PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms
Artikel i vetenskaplig tidskrift, 2023
We show that PARMA-CC algorithms yield equivalent clustering outcomes despite their different approaches. Furthermore, we show that certain PARMA-CC algorithms can achieve higher efficiency with respect to certain properties of the data to be clustered. Generally speaking, in PARMA-CC algorithms, parallel threads compute summaries associated with clusters of data (sub)sets. As the threads concurrently combine the summaries, they construct a comprehensive summary of the sets of clusters. By approximating a cluster with its respective geometrical summaries, PARMA-CC algorithms scale well with increased data volumes, and, by computing and efficiently combining the summaries in parallel, they enable latency improvements. PARMA-CC algorithms utilize special data structures that enable parallelism through in-place data processing. As we show in our analysis and evaluation, PARMA-CC algorithms can complement and outperform well-established methods, with significantly better scalability, while still providing highly accurate results in a variety of data sets, even with skewed data distributions, which cause the traditional approaches to exhibit their worst-case behaviour.
Parallel Clustering
Synchronization
Data Structures
Approximation
Författare
Amir Keramatian
Chalmers, Data- och informationsteknik, Dator- och nätverkssystem
Vincenzo Massimiliano Gulisano
Chalmers, Data- och informationsteknik, Dator- och nätverkssystem
Marina Papatriantafilou
Chalmers, Data- och informationsteknik, Dator- och nätverkssystem
Philippas Tsigas
Chalmers, Data- och informationsteknik, Dator- och nätverkssystem
Journal of Parallel and Distributed Computing
0743-7315 (ISSN) 1096-0848 (eISSN)
Vol. 177 68-88HAREN: Självdistribuerad och anpassningsbar dataströmningsanalys i dimman
Vetenskapsrådet (VR) (2016-03800), 2017-01-01 -- 2020-12-31.
Molnbaserade produkter och produktion (FiC)
Stiftelsen för Strategisk forskning (SSF) (GMT14-0032), 2016-01-01 -- 2020-12-31.
Ämneskategorier (SSIF 2011)
Datorteknik
Mediateknik
Datavetenskap (datalogi)
Datorsystem
Styrkeområden
Informations- och kommunikationsteknik
Produktion
Drivkrafter
Hållbar utveckling
DOI
10.1016/j.jpdc.2023.02.001