IP.LSH.DBSCAN: Integrated parallel density-based clustering by locality-sensitive hashing
Journal article, 2025

Locality-sensitive hashing (LSH) is an established method for fast data indexing and approximate similarity search, with useful parallelism properties. Although indexes and similarity measures are key for data clustering, little has been investigated on the multifaceted benefits of LSH in the problem. We show how approximate DBSCAN clustering can be fused into the process of creating an LSH index, and, through parallelization and fine-grained synchronization, also utilize efficiently available computing capacity. The resulting algorithm, IP.LSH.DBSCAN, described in this article, can support a wide range of applications with diverse distance functions, as well as data distributions and dimensionality. We analyse the algorithm’s asymptotic completion time and provide an open-source prototype implementation. We also conduct a detailed evaluation measuring latency and accuracy metrics of IP.LSH.DBSCAN, on a 36-core machine with 2-way hyper threading on massive data-sets with various numbers of dimensions. The analysis and the empirical study of IP.LSH.DBSCAN show how it complements the landscape of established state-of-the-art methods, by offering up to several orders of magnitude speed-up on higher dimensional datasets, with tunable high clustering accuracy.

Density-based clustering

High-dimension data analytics

Approximation algorithms

Similarity-based clustering

Data summarization

Author

Amir Keramatian

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Vincenzo Massimiliano Gulisano

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Marina Papatriantafilou

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Philippas Tsigas

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Discrete Applied Mathematics

0166-218X (ISSN)

Scalability and quality control in AM - Big Data and ML in Production

Chalmers, 2020-01-01 -- .

WASP-WISE STRATIFIER

Wallenberg AI, Autonomous Systems and Software Program, 2024-01-01 -- 2025-01-01.

Wallenberg Initiative Materials Science for Sustainability, 2024-01-01 -- 2025-01-01.

VR EPITOME - Summarization and structuring of continuous data in concurrent processing pipelines

Swedish Research Council (VR) (2021-05424), 2022-01-01 -- 2025-12-31.

Relaxed Semantics Across the Data Analytics Stack (RELAX-DN)

European Commission (EC) (EC/HE/101072456), 2023-03-01 -- 2027-03-01.

INDEED: Information and Data-processing in Focus for Energy Efficiency

Chalmers, 2020-01-01 -- .

Areas of Advance

Information and Communication Technology

Transport

Production

Energy

Subject Categories (SSIF 2025)

Bioinformatics (Computational Biology)

Computer Sciences

DOI

10.1016/j.dam.2025.11.047

More information

Latest update

12/17/2025