IP.LSH.DBSCAN: Integrated parallel density-based clustering by locality-sensitive hashing
Artikel i vetenskaplig tidskrift, 2026

Locality-sensitive hashing (LSH) is an established method for fast data indexing and approximate similarity search, with useful parallelism properties. Although indexes and similarity measures are key for data clustering, little has been investigated on the multifaceted benefits of LSH in the problem. We show how approximate DBSCAN clustering can be fused into the process of creating an LSH index, and, through parallelization and fine-grained synchronization, also utilize efficiently available computing capacity. The resulting algorithm, IP.LSH.DBSCAN, described in this article, can support a wide range of applications with diverse distance functions, as well as data distributions and dimensionality. We analyse the algorithm’s asymptotic completion time and provide an open-source prototype implementation. We also conduct a detailed evaluation measuring latency and accuracy metrics of IP.LSH.DBSCAN, on a 36-core machine with 2-way hyper threading on massive data-sets with various numbers of dimensions. The analysis and the empirical study of IP.LSH.DBSCAN show how it complements the landscape of established state-of-the-art methods, by offering up to several orders of magnitude speed-up on higher dimensional datasets, with tunable high clustering accuracy.

Density-based clustering

High-dimension data analytics

Data summarization

Similarity-based clustering

Approximation algorithms

Författare

Amir Keramatian

Göteborgs universitet

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Vincenzo Massimiliano Gulisano

Göteborgs universitet

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Marina Papatriantafilou

Göteborgs universitet

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Philippas Tsigas

Göteborgs universitet

Chalmers, Data- och informationsteknik, Dator- och nätverkssystem

Discrete Applied Mathematics

0166-218X (ISSN)

Vol. 382 183-196

INDEED: Information and Data-processing in Focus for Energy Efficiency

Chalmers, 2020-01-01 -- .

Anpassad datastruktursemantik för skalbar processering av data

Vetenskapsrådet (VR) (2021-05443), 2022-01-01 -- 2025-12-31.

VR EPITOME - Sammanfattning och strukturering av kontinuerlig data i pipelines för samtidig behandling

Vetenskapsrådet (VR) (2021-05424), 2022-01-01 -- 2025-12-31.

Relaxed Semantics Across the Data Analytics Stack (RELAX-DN)

Europeiska kommissionen (EU) (EC/HE/101072456), 2023-03-01 -- 2027-03-01.

Skalbarhet och kvalitetskontroll i AM -- Big Data och ML i tillverkningsprocesser

Chalmers, 2020-01-01 -- .

Styrkeområden

Informations- och kommunikationsteknik

Transport

Produktion

Energi

Ämneskategorier (SSIF 2025)

Bioinformatik (beräkningsbiologi)

Datavetenskap (datalogi)

DOI

10.1016/j.dam.2025.11.047

Relaterade dataset

Artifact and instructions to generate experimental results for the Euro-Par 2022 paper: "IP.LSH.DBSCAN: Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing" [dataset]

DOI: https://doi.org/10.6084/m9.figshare.19991786

Mer information

Senast uppdaterat

2025-12-17