MolEncoder: towards optimal masked language modeling for molecules
Journal article, 2025

Predicting molecular properties is a key challenge in drug discovery. Machine learning models, especially those based on transformer architectures, are increasingly used to make these predictions from chemical structures. Inspired by recent progress in natural language processing, many studies have adopted encoder-only transformer architectures similar to BERT (Bidirectional Encoder Representations from Transformers) for this task. These models are pretrained using masked language modeling, where parts of the input are hidden and the model learns to recover them before fine-tuning on downstream tasks. In this work, we systematically investigate whether core assumptions from natural language processing, which are commonly adopted in molecular BERT-based models, actually hold when applied to molecules represented using the Simplified Molecular Input Line Entry System (SMILES). Specifically, we examine how masking ratio, pretraining dataset size, and model size affect performance in molecular property prediction. We find that higher masking ratios than commonly used significantly improve performance. In contrast, increasing model or pretraining dataset size quickly leads to diminishing returns, offering no consistent benefit while incurring significantly higher computational cost. Based on these insights, we develop MolEncoder, a BERT-based model that outperforms existing approaches on drug discovery tasks while being more computationally efficient. Our results highlight key differences between molecular pretraining and natural language processing, showing that they require different design choices. This enables more efficient model development and lowers barriers for researchers with limited computational resources. We release MolEncoder publicly to support future work and hope our findings help make molecular representation learning more accessible and cost-effective in drug discovery.

Author

Fabian P. Kruger

AstraZeneca AB

Helmholtz Munich

Technical University of Munich

Nicklas Osterbacka

AstraZeneca AB

Mikhail Kabeshov

AstraZeneca AB

Ola Engkvist

Chalmers, Computer Science and Engineering (Chalmers), Data Science and AI

University of Gothenburg

Igor Tetko

Helmholtz Munich

DIGITAL DISCOVERY

2635-098X (eISSN)

Vol. In Press

Subject Categories (SSIF 2025)

Bioinformatics (Computational Biology)

Computer Sciences

DOI

10.1039/d5dd00369e

More information

Latest update

12/3/2025