A data science roadmap for open science organizations engaged in early-stage drug discovery
Artikel i vetenskaplig tidskrift, 2024

The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.

Författare

Kristina Edfeldt

Karolinska universitetssjukhuset

Aled M. Edwards

University of Toronto

Ola Engkvist

Chalmers, Data- och informationsteknik

Judith Günther

Bayer AG

Matthew Hartley

European Bioinformatics Institute

David G. Hulcoop

Open Targets

European Bioinformatics Institute

Andrew R. Leach

European Bioinformatics Institute

Brian D. Marsden

University of Oxford

Amelie Menge

Johann Wolfgang Goethe Universität Frankfurt am Main

Leonie Misquitta

National Library of Medicine (NLM)

Susanne Müller

Johann Wolfgang Goethe Universität Frankfurt am Main

Dafydd R. Owen

Pfizer

Kristof T. Schütt

Pfizer

Nicholas Skelton

Genentech

Andreas Steffen

Pfizer

Alexander Tropsha

University of Northern Colorado

Erik Vernet

Novo Nordisk

Yanli Wang

National Library of Medicine (NLM)

James Wellnitz

University of Northern Colorado

Timothy M. Willson

University of Northern Colorado

Djork Arné Clevert

Pfizer

Benjamin Haibe-Kains

Vector Institute for AI

University Health Network

University of Toronto

Lovisa Holmberg Schiavone

AstraZeneca AB

Matthieu Schapira

University of Toronto

Nature Communications

2041-1723 (ISSN) 20411723 (eISSN)

Vol. 15 1 5640

Ämneskategorier (SSIF 2025)

Bioinformatik (beräkningsbiologi)

Datavetenskap (datalogi)

DOI

10.1038/s41467-024-49777-x

Mer information

Senast uppdaterat

2025-05-16