Efficient decoy selection to improve virtual screening using machine learning models
Artikel i vetenskaplig tidskrift, 2025

Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.

PADIF

Protein-ligand interaction fingerprint

Molecular docking

Virtual screening

Specific scoring function

Decoys

Författare

Felipe Victoria-Muñoz

Universität Münster

Janosch Menke

Chalmers, Data- och informationsteknik, Data Science och AI

Universität Münster

Göteborgs universitet

Norberto Sanchez-Cruz

Universidad Nacional Autónoma de México

Oliver Koch

Universität Münster

Journal of Cheminformatics

1758-2946 (ISSN) 17582946 (eISSN)

Vol. 17 1 165

Ämneskategorier (SSIF 2025)

Bioinformatik (beräkningsbiologi)

DOI

10.1186/s13321-025-01107-z

Relaterade dataset

PADIF Decoys Test [dataset]

URI: https://github.com/kochgroup/PADIF-wf

Mer information

Senast uppdaterat

2025-11-07