Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers

Christoffer Ivarsson Orrelid; Oscar Rosberg; Sophia Weiner; Fredrik Johansson; Johan Gobom; Henrik Zetterberg; Newton Mwai Kinyanjui; Lena Stempfle

doi:10.1186/s12987-025-00634-z

Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers
Artikel i vetenskaplig tidskrift, 2025

Purpose: This study explores the application of machine learning to high-dimensional proteomics datasets for identifying Alzheimer’s disease (AD) biomarkers. AD, a neurodegenerative disorder affecting millions worldwide, necessitates early and accurate diagnosis for effective management. Methods: We leverage Tandem Mass Tag (TMT) proteomics data from the cerebrospinal fluid (CSF) samples from the frontal cortex of patients with idiopathic normal pressure hydrocephalus (iNPH), a condition often comorbid with AD, with rare access to both lumbar and ventricular samples. Our methodology includes extensive data preprocessing to address batch effects and missing values, followed by the use of the Synthetic Minority Over-sampling Technique (SMOTE) for data augmentation to overcome the small sample size. We apply linear, and non-linear machine learning models, and ensemble methods, to compare iNPH patients with and without biomarker evidence of AD pathology (Aβ-T- or Aβ+T+) in a classification task. Results: We present a machine learning workflow for working with high-dimensional TMT proteomics data that addresses their inherent data characteristics. Our results demonstrate that batch effect correction has no or minor impact on the models’ performance and robust feature selection is critical for model stability and performance, especially in the high-dimensional proteomics data setting for AD diagnostics. The results further indicated that removing features with missing values produced stronger models than imputing them, and the batch effect had minimal impact on the models Our best-performing disease-progression detection model, a random forest, achieves an AUC of 0.84 (± 0.03). Conclusion: We identify several novel protein biomarkers candidates, such as FABP3 and GOT1, with potential diagnostic value for AD pathology detection, suggesting the necessity of different biomarkers for AD diagnoses for patients with iNPH, and considering different biomarkers for ventricular and lumbar CSF samples. This work underscores the importance of a meticulous machine learning process in enhancing biomarker discovery. Our study also provides insights in translating biomarkers from other central nervous system diseases like iNPH, and both ventricular and lumbar CSF samples for biomarker discovery, providing a foundation for future research and clinical applications.

Feature selection

High-dimensional data

Mass spectrometry

Biomarkers

Alzheimer’s disease

Machine learning

Proteomics

Författare

Christoffer Ivarsson Orrelid

Student vid Chalmers

Oscar Rosberg

Student vid Chalmers

Sophia Weiner

Göteborgs universitet

Fredrik Johansson

Data Science och AI 3

Forskning Andra publikationer

Johan Gobom

Göteborgs universitet

Sahlgrenska universitetssjukhuset

Henrik Zetterberg

Sahlgrenska universitetssjukhuset

University College London (UCL)

Hong Kong Center for Neurodegenerative Diseases

Göteborgs universitet

University of Wisconsin

Newton Mwai Kinyanjui

Chalmers, Data- och informationsteknik, Data Science och AI

Forskning Andra publikationer

Lena Stempfle

Chalmers, Data- och informationsteknik, Data Science och AI

Forskning Andra publikationer

Fluids and Barriers of the CNS

20458118 (eISSN)

Vol. 22 1 23

Ämneskategorier (SSIF 2025)

Bioinformatik (beräkningsbiologi)

Bioinformatik och beräkningsbiologi

DOI

10.1186/s12987-025-00634-z

Publikationsdata kopplat till DOI

PubMed

40033432

Mer information

Senast uppdaterat

2026-02-09

Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers Artikel i vetenskaplig tidskrift, 2025

Författare

Christoffer Ivarsson Orrelid

Oscar Rosberg

Sophia Weiner

Fredrik Johansson

Johan Gobom

Henrik Zetterberg

Newton Mwai Kinyanjui

Lena Stempfle

Fluids and Barriers of the CNS

Ämneskategorier (SSIF 2025)

DOI

PubMed

Mer information

Senast uppdaterat

Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers
Artikel i vetenskaplig tidskrift, 2025