IP.LSH.DBSCAN: Integrated parallel density-based clustering by locality-sensitive hashing
Journal article, 2026

Locality-sensitive hashing (LSH) is an established method for fast data indexing and approximate similarity search, with useful parallelism properties. Although indexes and similarity measures are key for data clustering, little has been investigated on the multifaceted benefits of LSH in the problem. We show how approximate DBSCAN clustering can be fused into the process of creating an LSH index, and, through parallelization and fine-grained synchronization, also utilize efficiently available computing capacity. The resulting algorithm, IP.LSH.DBSCAN, described in this article, can support a wide range of applications with diverse distance functions, as well as data distributions and dimensionality. We analyse the algorithm’s asymptotic completion time and provide an open-source prototype implementation. We also conduct a detailed evaluation measuring latency and accuracy metrics of IP.LSH.DBSCAN, on a 36-core machine with 2-way hyper threading on massive data-sets with various numbers of dimensions. The analysis and the empirical study of IP.LSH.DBSCAN show how it complements the landscape of established state-of-the-art methods, by offering up to several orders of magnitude speed-up on higher dimensional datasets, with tunable high clustering accuracy.

Data summarization

Approximation algorithms

High-dimension data analytics

Similarity-based clustering

Density-based clustering

Author

Amir Keramatian

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Vincenzo Massimiliano Gulisano

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Marina Papatriantafilou

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

University of Gothenburg

Philippas Tsigas

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Computer and Network Systems

Discrete Applied Mathematics

0166-218X (ISSN)

Vol. 382 183-196

VR EPITOME - Summarization and structuring of continuous data in concurrent processing pipelines

Swedish Research Council (VR) (2021-05424), 2022-01-01 -- 2025-12-31.

Relaxed Semantics Across the Data Analytics Stack (RELAX-DN)

European Commission (EC) (EC/HE/101072456), 2023-03-01 -- 2027-03-01.

Scalability and quality control in AM - Big Data and ML in Production

Chalmers, 2020-01-01 -- .

INDEED: Information and Data-processing in Focus for Energy Efficiency

Chalmers, 2020-01-01 -- .

Relaxed Concurrent Data Structure Semantics for Scalable Data Processing

Swedish Research Council (VR) (2021-05443), 2022-01-01 -- 2025-12-31.

Areas of Advance

Information and Communication Technology

Transport

Production

Energy

Subject Categories (SSIF 2025)

Bioinformatics (Computational Biology)

Computer Sciences

DOI

10.1016/j.dam.2025.11.047

Related datasets

Artifact and instructions to generate experimental results for the Euro-Par 2022 paper: "IP.LSH.DBSCAN: Integrated Parallel Density-Based Clustering through Locality-Sensitive Hashing" [dataset]

DOI: https://doi.org/10.6084/m9.figshare.19991786

More information

Latest update

12/22/2025