Efficient decoy selection to improve virtual screening using machine learning models
Journal article, 2025

Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.

PADIF

Protein-ligand interaction fingerprint

Molecular docking

Virtual screening

Specific scoring function

Decoys

Author

Felipe Victoria-Muñoz

University of Münster

Janosch Menke

Chalmers, Computer Science and Engineering (Chalmers), Data Science and AI

University of Münster

University of Gothenburg

Norberto Sanchez-Cruz

Universidad Nacional Autónoma de México

Oliver Koch

University of Münster

Journal of Cheminformatics

1758-2946 (ISSN) 17582946 (eISSN)

Vol. 17 1 165

Subject Categories (SSIF 2025)

Bioinformatics (Computational Biology)

DOI

10.1186/s13321-025-01107-z

Related datasets

PADIF Decoys Test [dataset]

URI: https://github.com/kochgroup/PADIF-wf

More information

Latest update

11/7/2025