IP.LSH.DBSCAN: Integrated parallel density-based clustering by locality-sensitive hashing
Journal article, 2026

Locality-sensitive hashing (LSH) is an established method for fast data indexing and approximate similarity search, with useful parallelism properties. Although indexes and similarity measures are key for data clustering, little has been investigated on the multifaceted benefits of LSH in the problem. We show how approximate DBSCAN clustering can be fused into the process of creating an LSH index, and, through parallelization and fine-grained synchronization, also utilize efficiently available computing capacity. The resulting algorithm, IP.LSH.DBSCAN, described in this article, can support a wide range of applications with diverse distance functions, as well as data distributions and dimensionality. We analyse the algorithm’s asymptotic completion time and provide an open-source prototype implementation. We also conduct a detailed evaluation measuring latency and accuracy metrics of IP.LSH.DBSCAN, on a 36-core machine with 2-way hyper threading on massive data-sets with various numbers of dimensions. The analysis and the empirical study of IP.LSH.DBSCAN show how it complements the landscape of established state-of-the-art methods, by offering up to several orders of magnitude speed-up on higher dimensional datasets, with tunable high clustering accuracy.

Density-based clustering

High-dimension data analytics

Data summarization

Similarity-based clustering

Approximation algorithms

Author

Amir Keramatian

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Vincenzo Massimiliano Gulisano

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Marina Papatriantafilou

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Philippas Tsigas

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Discrete Applied Mathematics

0166-218X (ISSN)

Vol. 382 183-196

INDEED: Information and Data-processing in Focus for Energy Efficiency

Chalmers, 2020-01-01 -- .

Relaxed Concurrent Data Structure Semantics for Scalable Data Processing

Swedish Research Council (VR) (2021-05443), 2022-01-01 -- 2025-12-31.

VR EPITOME - Summarization and structuring of continuous data in concurrent processing pipelines

Swedish Research Council (VR) (2021-05424), 2022-01-01 -- 2025-12-31.

Relaxed Semantics Across the Data Analytics Stack (RELAX-DN)

European Commission (EC) (EC/HE/101072456), 2023-03-01 -- 2027-03-01.

Scalability and quality control in AM - Big Data and ML in Production

Chalmers, 2020-01-01 -- .

Areas of Advance

Information and Communication Technology

Transport

Production

Energy

Subject Categories (SSIF 2025)

Bioinformatics (Computational Biology)

Computer Sciences

DOI

10.1016/j.dam.2025.11.047

Related datasets

Artifact and instructions to generate experimental results for the Euro-Par 2022 paper: "IP.LSH.DBSCAN: Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing" [dataset]

DOI: https://doi.org/10.6084/m9.figshare.19991786

More information

Latest update

12/17/2025