PARMA-CC: A Family of Parallel Multiphase Approximate Cluster Combining Algorithms
Preprint, 2022
We show that PARMA-CC algorithms yield equivalent clustering outcomes despite their different approaches. Furthermore, we show that certain PARMA-CC algorithms can achieve higher efficiency with respect to certain properties of the data to be clustered. Generally speaking, in PARMA-CC algorithms, parallel threads compute summaries associated with clusters of data (sub)sets. As the threads concurrently combine the summaries, they construct a comprehensive summary of the sets of clusters. By approximating a cluster with its respective geometrical summaries, PARMA-CC algorithms scale well with increased data volumes, and, by computing and efficiently combining the summaries in parallel, they enable latency improvements. PARMA-CC algorithms utilize special data structures that enable parallelism through in-place data processing. As we show in our analysis and evaluation, PARMA-CC algorithms can complement and outperform well-established methods, with significantly better scalability, while still providing highly accurate results in a variety of data sets, even with skewed data distributions, which cause the traditional approaches to exhibit their worst-case behaviour.
Synchronization
Data Structures
Parallel Clustering
Approximation
Författare
Amir Keramatian
Nätverk och System
Vincenzo Massimiliano Gulisano
Nätverk och System
Marina Papatriantafilou
Nätverk och System
Philippas Tsigas
Nätverk och System
Molnbaserade produkter och produktion (FiC)
Stiftelsen för Strategisk forskning (SSF) (GMT14-0032), 2016-01-01 -- 2020-12-31.
Ämneskategorier
Datorteknik
Datavetenskap (datalogi)
Datorsystem
Styrkeområden
Informations- och kommunikationsteknik
Drivkrafter
Hållbar utveckling