Do Chemformers Dream of Organic Matter? Evaluating a Transformer Model for Multistep Retrosynthesis
Artikel i vetenskaplig tidskrift, 2024

Synthesis planning of new pharmaceutical compounds is a well-known bottleneck in modern drug design. Template-free methods, such as transformers, have recently been proposed as an alternative to template-based methods for single-step retrosynthetic predictions. Here, we trained and evaluated a transformer model, called the Chemformer, for retrosynthesis predictions within drug discovery. The proprietary data set used for training comprised ∼18 M reactions from literature, patents, and electronic lab notebooks. Chemformer was evaluated for the purpose of both single-step and multistep retrosynthesis. We found that the single-step performance of Chemformer was especially good on reaction classes common in drug discovery, with most reaction classes showing a top-10 round-trip accuracy above 0.97. Moreover, Chemformer reached a higher round-trip accuracy compared to that of a template-based model. By analyzing multistep retrosynthesis experiments, we observed that Chemformer found synthetic routes, leading to commercial starting materials for 95% of the target compounds, an increase of more than 20% compared to the template-based model on a proprietary compound data set. In addition to this, we discovered that Chemformer suggested novel disconnections corresponding to reaction templates, which are not included in the template-based model. These findings were further supported by a publicly available ChEMBL compound data set. The conclusions drawn from this work allow for the design of a synthesis planning tool where template-based and template-free models work in harmony to optimize retrosynthetic recommendations.

Författare

Annie M. Westerlund

AstraZeneca AB

Siva Manohar Koki

AstraZeneca AB

Supriya Kancharla

AstraZeneca AB

Alessandro Tibo

AstraZeneca AB

Lakshidaa Saigiridharan

AstraZeneca AB

Mikhail Kabeshov

AstraZeneca AB

Rocio Mercado

Chalmers, Data- och informationsteknik, Data Science och AI

Samuel Genheden

AstraZeneca AB

Journal of Chemical Information and Modeling

1549-9596 (ISSN) 1549960x (eISSN)

Vol. 64 8 3021-3033

Ämneskategorier

Medicinteknik

Organisk kemi

DOI

10.1021/acs.jcim.3c01685

PubMed

38602390

Mer information

Senast uppdaterat

2024-05-11