Learning-Based Sensor Fusion for 3D Scene Understanding in ADAS and AD
Licentiatavhandling, 2026
systems (ADAS) and autonomous driving (AD), which must operate reliably
in complex, continuously changing environments. To achieve this, information
from multiple sensors such as cameras, lidar, and radar must be integrated to
exploit their complementary strengths while remaining robust to sensor degradation
and partial observability. This thesis studies learning-based multimodal
perception focusing on architectural design choices for sensor fusion, spatial
scene representations, and geometry-centric representation learning.
In multimodal perception, a key design question is at what stage information
from different sensors should be fused. We investigate how mid-level fusion
enables effective learning of both modality-specific feature extraction and
cross-modal interaction. In structured spatial representations such as Bird’s-
Eye-View (BEV) grids, we show that attention-based fusion allows models
to dynamically weight sensor contributions depending on context, improving
robustness and scene understanding compared to early feature collapse.
While BEV representations provide a convenient fusion space, they impose
fixed spatial discretization and scale poorly to full three-dimensional reasoning.
To address this, we explore probabilistic scene representations based on learnable
3D Gaussian particles, showing that sparse distance measurements from lidar
and radar serve as inductive priors for stable multimodal learning in less
structured scene representations.
Finally, we study geometry-centric pre-training using occupancy estimation
as a supervision signal and show that while geometric structure yields strong
spatial reasoning, it requires complementary feature separation mechanisms to
achieve semantic discriminability in fine-grained classification tasks.
In general, the results suggest that the multimodal perception that emerges
from the joint design of fusion strategies, scene representations, and learning
objectives can form a robust and scalable foundation for scene understanding
in safety-critical automotive applications.
Foundational model
Pretraining
ADAS
Bird’s-Eye-View
Scene Representation
Lidar
Multimodal Sensor Fusion
AD
3D Gaussian particles
Radar
Camera
Multimodal Learning
Författare
Amer Mustajbasic
Chalmers, Data- och informationsteknik, Data Science och AI
Guided Gaussians: Enhancing 3D Occupancy Estimation with Sparse Sensor Priors
Frontiers in Artificial Intelligence and Applications,;Vol. 413(2025)
Paper i proceeding
SMAB: Simple Multimodal Attention for Effective BEV Fusion
IEEE Intelligent Vehicles Symposium, Proceedings,;(2025)p. 1766-1772
Paper i proceeding
A. Mustajbasic, S. Chen, E. Stenborg, and Selpi. GeoPriors: Learning Latent 3D Structure via Occupancy Pre-Training for Efficient Multi-Task Scene Understanding
Djupt multimodalt lärande för fordonstillämpningar
VINNOVA (2023-00763), 2023-09-01 -- 2027-09-01.
Styrkeområden
Informations- och kommunikationsteknik
Transport
Ämneskategorier (SSIF 2025)
Datorseende och lärande system
Datavetenskap (datalogi)
Signalbehandling
Artificiell intelligens
Utgivare
Chalmers
Room EA, EDIT building, Rännvägen 6, Chalmers
Opponent: Associate Professor Eren Aksoy, Lund University, Sweden