Learning-Based Sensor Fusion for 3D Scene Understanding in ADAS and AD
Licentiate thesis, 2026

Robust perception is a central requirement for advanced driver assistance
systems (ADAS) and autonomous driving (AD), which must operate reliably
in complex, continuously changing environments. To achieve this, information
from multiple sensors such as cameras, lidar, and radar must be integrated to
exploit their complementary strengths while remaining robust to sensor degradation
and partial observability. This thesis studies learning-based multimodal
perception focusing on architectural design choices for sensor fusion, spatial
scene representations, and geometry-centric representation learning.

In multimodal perception, a key design question is at what stage information
from different sensors should be fused. We investigate how mid-level fusion
enables effective learning of both modality-specific feature extraction and
cross-modal interaction. In structured spatial representations such as Bird’s-
Eye-View (BEV) grids, we show that attention-based fusion allows models
to dynamically weight sensor contributions depending on context, improving
robustness and scene understanding compared to early feature collapse.
While BEV representations provide a convenient fusion space, they impose
fixed spatial discretization and scale poorly to full three-dimensional reasoning.
To address this, we explore probabilistic scene representations based on learnable
3D Gaussian particles, showing that sparse distance measurements from lidar
and radar serve as inductive priors for stable multimodal learning in less
structured scene representations.

Finally, we study geometry-centric pre-training using occupancy estimation
as a supervision signal and show that while geometric structure yields strong
spatial reasoning, it requires complementary feature separation mechanisms to
achieve semantic discriminability in fine-grained classification tasks.
In general, the results suggest that the multimodal perception that emerges
from the joint design of fusion strategies, scene representations, and learning
objectives can form a robust and scalable foundation for scene understanding
in safety-critical automotive applications.

Foundational model

Pretraining

ADAS

Bird’s-Eye-View

Scene Representation

Lidar

Multimodal Sensor Fusion

AD

3D Gaussian particles

Radar

Camera

Multimodal Learning

Room EA, EDIT building, Rännvägen 6, Chalmers
Opponent: Associate Professor Eren Aksoy, Lund University, Sweden

Author

Amer Mustajbasic

Chalmers, Computer Science and Engineering (Chalmers), Data Science and AI

Guided Gaussians: Enhancing 3D Occupancy Estimation with Sparse Sensor Priors

Frontiers in Artificial Intelligence and Applications,;Vol. 413(2025)

Paper in proceeding

SMAB: Simple Multimodal Attention for Effective BEV Fusion

IEEE Intelligent Vehicles Symposium, Proceedings,;(2025)p. 1766-1772

Paper in proceeding

A. Mustajbasic, S. Chen, E. Stenborg, and Selpi. GeoPriors: Learning Latent 3D Structure via Occupancy Pre-Training for Efficient Multi-Task Scene Understanding

Deep MultiModal Learning for Automotive Applications

VINNOVA (2023-00763), 2023-09-01 -- 2027-09-01.

Areas of Advance

Information and Communication Technology

Transport

Subject Categories (SSIF 2025)

Computer Vision and learning System

Computer Sciences

Signal Processing

Artificial Intelligence

Publisher

Chalmers

Room EA, EDIT building, Rännvägen 6, Chalmers

Online

Opponent: Associate Professor Eren Aksoy, Lund University, Sweden

More information

Latest update

4/8/2026 1