PROTAC-Splitter: a machine learning framework for automated identification of PROTAC substructures
Artikel i vetenskaplig tidskrift, 2026

Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules composed of an E3 ligase ligand, a linker, and a warhead targeting a protein of interest. Despite their modular structure, accurately identifying and annotating these components in PROTACs is challenging and typically relies on manual curation and predefined substructure matching. To address this, we developed PROTAC-Splitter, a machine learning framework designed for automated annotation of PROTAC substructures. To address data scarcity, we generated and openly released a synthetic dataset containing approximately 1.3 million PROTAC structures with annotated ligand splits. Leveraging this dataset, we developed two complementary approaches for PROTAC substructure annotation: a Transformer-based sequence-to-sequence model and a graph-based XGBoost model. We evaluated both approaches on held-out public data and structurally novel PROTACs from AstraZeneca’s proprietary collection. The Transformer-based model achieved high exact-match accuracy (86%) on public data but dropped significantly (18%) on structurally novel internal PROTACs due to occasional hallucinations. In contrast, the XGBoost model can ensure chemical validity and perfect reassembly accuracy on both sets, with lower exact-match accuracy on open-data (42.2%) but comparable performance on the internal set (23%). To improve reliability, we implemented a wrapper function for the Transformer (Transformer-Δ), which corrects partial prediction errors, raising reassembly accuracy to 96% on public and 70% on internal datasets. Combining the strengths of both models, we propose a hybrid approach that reliably annotates PROTACs across diverse chemical spaces. PROTAC-Splitter provides a robust, scalable tool to facilitate automated PROTAC analysis and is available open-source at https://github.com/ribesstefano/PROTAC-Splitter

Cheminformatics

Drug discovery

Targeted protein degradation

Machine learning

PROTAC

Författare

Stefano Ribes

Göteborgs universitet

Chalmers, Data- och informationsteknik, Data Science och AI

Ranxuan Zhang

Student vid Chalmers

Télio Corentin Cropsal

Chalmers, Data- och informationsteknik, Data Science och AI

Göteborgs universitet

Andreas Källberg

Programvaruteknik, Grupp C2

Christian Tyrchan

AstraZeneca AB

Eva Nittinger

AstraZeneca AB

Rocio Mercado

Göteborgs universitet

Chalmers, Data- och informationsteknik, Data Science och AI

Journal of Cheminformatics

1758-2946 (ISSN) 17582946 (eISSN)

Vol. 18 1 30

Ämneskategorier (SSIF 2025)

Bioinformatik (beräkningsbiologi)

Fysikalisk kemi

DOI

10.1186/s13321-025-01135-9

PubMed

41721433

Mer information

Senast uppdaterat

2026-02-27