Handling Data Leakage in Automotive Datasets for Object Detection
Licentiate thesis, 2025
Method: The research follows an empirical approach and is structured with four studies. Papers A and B adopt a quantitative, experimental design to investigate the extent of various data splitting strategies of automotive image datasets and the overall impact on the model performance. Papers C and D follow a design science methodology, introducing the D-LeDe method for detecting data leakage in any existing data split, and evaluating its effectiveness on multiple automotive datasets and object detection model architectures.
Findings: The results in paper A demonstrate that image similarity can have a measurable impact on reported performance in object detection models. Results in paper B show that splitting data based on semantic similarity can significantly enhance the overall performance. The proposed D-LeDe method is presented in paper C to detect data leakage in any existing data split. Paper D presented how effectively the D-LeDe method performs across different datasets and with multiple models to identify which of the splits have data leakage in them.
Conclusions: This thesis demonstrates that data leakage caused by image similarity across dataset partitions is a tangible and non-trivial problem in automotive object detection. Even modest overlap between training and test sets can inflate model performance, leading to overly optimistic conclusions about generalisation. In larger or highly redundant datasets, the effect can be even stronger. The findings confirm that data leakage is a systematic risk that can occur in many renowned datasets and can compromise the reliability of models benchmarked using the default splits of such datasets. Recognising this problem and practicing caution to avoid data leakage is therefore essential to ensure the reliability of automotive perception models.
Automotive Perception
Dataset Splitting
Object Detection
Image Similarity
Data Leakage
Author
Md Abu Ahammed Babu
University of Gothenburg
Chalmers, Computer Science and Engineering (Chalmers), Interaction Design and Software Engineering
Impact of Image Data Splitting on the Performance of Automotive Perception Systems
Lecture Notes in Business Information Processing,;Vol. 505 LNBIP(2024)p. 91-111
Paper in proceeding
Exploring Image Similarity-Based Splitting Techniques in Automotive Perception Systems
Communications in Computer and Information Science,;Vol. 2178 CCIS(2024)p. 51-67
Paper in proceeding
D-LeDe: A Data Leakage Detection Method for Automotive Perception Systems
International Conference on Vehicle Technology and Intelligent Transport Systems, VEHITS - Proceedings,;(2025)p. 210-221
Paper in proceeding
M. A. A. Babu, M. Staron, S. K. Pandey, D. Durisic, and A. Bálint. “Evaluation of the D-LeDe Method for Image Data Leakage Detection in Automotive Datasets.” Submitted to the Journal of Machine Vision and Applications 2025.
Subject Categories (SSIF 2025)
Communication Systems
Computer graphics and computer vision
Computer Sciences
Publisher
Chalmers
Jupiter 520, Campus Lindholmen
Opponent: Dr. Yanja Dajsuren, Eindhoven University of Technology, the Netherlands