Leveraging joint vision-language models to improve diagnostic accuracy and disease localization in medical imaging
Research Project, 2026
– 2030
This project develops advanced multimodal vision–language models for medical image analysis, to address challenges in diagnostic accuracy, disease localization, prediction confidence, and visual understanding. By integrating privileged text from radiology reports with imaging data, our approach enables models robust representation learning for real-time, image-only inference, plus disease localization without relying on labor-intensive pixel annotations for training. We employ state-of-the-art techniques—including contrastive learning, adapted text encoding, and fusion strategies—to harness the complementary strengths of images, text, and tabular treatment data. This integration not only enhances diagnostic performance and confidence calibration but also supports weakly supervised localization and improves generalization across diverse populations. Our team, comprising researchers at Chalmers and NYU, here focus our general methodology on evaluating lymphoma in positron emission tomography (PET) imaging. Over a five-year period, our research will initially focus on refining image–text alignment for accurate diagnosis and localization, later expanding to incorporate treatment data for deeper clinical insights and explainable reasoning. Our work aims to increase accuracy, reduce false positives and unnecessary testing, and ultimately improve patient outcomes while advancing multimodal domain adaptation and weak supervision methodologies in medical AI.
Participants
Ida Häggström (contact)
Chalmers, Electrical Engineering, Signal Processing and Biomedical Engineering
Funding
Swedish Research Council (VR)
Project ID: 2025-05231
Funding Chalmers participation during 2026–2030