Handling Data Leakage in Automotive Datasets for Object Detection
Licentiate thesis, 2025

Background: Object detection is a central component of automotive perception systems that supports safe operation of autonomous driving technologies. The performance of such models is typically evaluated using large-scale image datasets, where choices in dataset construction and splitting strategies strongly influence the ability to detect objects correctly and accurately. Specifically, image similarity between training and test sets can unintentionally create data leakage, a situation where information from the test set is indirectly accessible during training, leading to overly optimistic performance estimates and threatening the reliability of evaluation results. This can have serious consequences in safety-critical domains such as autonomous driving, where a robust and trustworthy object detection model is a prerequisite for deployment. Objective: The overall aim of this thesis is to investigate the problem of data leakage in automotive object detection research. Specifically, it seeks to understand how different dataset splitting strategies impact the evaluation performance of models, and also to establish methods for detecting data leakage in existing train-test splits of frequently used automotive datasets to enable a more reliable assessment of object detection models.
Method: The research follows an empirical approach and is structured with four studies. Papers A and B adopt a quantitative, experimental design to investigate the extent of various data splitting strategies of automotive image datasets and the overall impact on the model performance. Papers C and D follow a design science methodology, introducing the D-LeDe method for detecting data leakage in any existing data split, and evaluating its effectiveness on multiple automotive datasets and object detection model architectures.
Findings: The results in paper A demonstrate that image similarity can have a measurable impact on reported performance in object detection models. Results in paper B show that splitting data based on semantic similarity can significantly enhance the overall performance. The proposed D-LeDe method is presented in paper C to detect data leakage in any existing data split. Paper D presented how effectively the D-LeDe method performs across different datasets and with multiple models to identify which of the splits have data leakage in them.
Conclusions: This thesis demonstrates that data leakage caused by image similarity across dataset partitions is a tangible and non-trivial problem in automotive object detection. Even modest overlap between training and test sets can inflate model performance, leading to overly optimistic conclusions about generalisation. In larger or highly redundant datasets, the effect can be even stronger. The findings confirm that data leakage is a systematic risk that can occur in many renowned datasets and can compromise the reliability of models benchmarked using the default splits of such datasets. Recognising this problem and practicing caution to avoid data leakage is therefore essential to ensure the reliability of automotive perception models.

Automotive Perception

Dataset Splitting

Object Detection

Image Similarity

Data Leakage

Jupiter 520, Campus Lindholmen
Opponent: Dr. Yanja Dajsuren, Eindhoven University of Technology, the Netherlands

Author

Md Abu Ahammed Babu

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Interaction Design and Software Engineering

Impact of Image Data Splitting on the Performance of Automotive Perception Systems

Lecture Notes in Business Information Processing,;Vol. 505 LNBIP(2024)p. 91-111

Paper in proceeding

Exploring Image Similarity-Based Splitting Techniques in Automotive Perception Systems

Communications in Computer and Information Science,;Vol. 2178 CCIS(2024)p. 51-67

Paper in proceeding

D-LeDe: A Data Leakage Detection Method for Automotive Perception Systems

International Conference on Vehicle Technology and Intelligent Transport Systems, VEHITS - Proceedings,;(2025)p. 210-221

Paper in proceeding

M. A. A. Babu, M. Staron, S. K. Pandey, D. Durisic, and A. Bálint. “Evaluation of the D-LeDe Method for Image Data Leakage Detection in Automotive Datasets.” Submitted to the Journal of Machine Vision and Applications 2025.

Subject Categories (SSIF 2025)

Communication Systems

Computer graphics and computer vision

Computer Sciences

Publisher

Chalmers

Jupiter 520, Campus Lindholmen

Online

Opponent: Dr. Yanja Dajsuren, Eindhoven University of Technology, the Netherlands

More information

Latest update

12/3/2025