MolEncoder: Improved Masked Language Modeling for Molecules

Fabian P. Krüger; Nicklas Österbacka; Mikhail Kabeshov; Ola Engkvist; Igor V. Tetko

doi:10.1007/978-3-032-04552-2_6

MolEncoder: Improved Masked Language Modeling for Molecules
Paper i proceeding, 2026

Predicting molecular properties is an important challenge in drug discovery. Machine learning methods, particularly those based on transformer architectures, have become increasingly popular for this task by learning molecular representations directly from chemical structure [1, 2]. Motivated by progress in natural language processing, many recent approaches apply models of the BERT (Bidirectional Encoder Representations from Transformers) architecture [3] to molecular data using SMILES as the input format [4, 5, 6, 7, 8–9]. In this study, we revisit core design assumptions that originate in natural language processing but are often carried over to molecular tasks without modification. We explore how variations in masking strategies, pretraining dataset size, and model size influence downstream performance in molecular property prediction. Our findings suggest that common practices inherited from natural language processing do not always yield optimal results in this setting. In particular, we observe that increasing the masking ratio can lead to significant improvements, while scaling up the model or dataset size results in stagnating gains despite higher computational cost (Fig. 1). Building on these observations, we develop MolEncoder, a BERT-style model that achieves improved performance on standard benchmarks while remaining more efficient than existing approaches. These insights highlight meaningful differences between molecular and textual learning settings. By identifying design choices better suited to chemical data, we aim to support more effective and efficient model development for researchers working in drug discovery and related fields.

Författare

Fabian P. Krüger

Technische Universität München

Helmholtz-Gemeinschaft Deutscher Forschungszentren

AstraZeneca AB

Nicklas Österbacka

AstraZeneca AB

Mikhail Kabeshov

AstraZeneca AB

Ola Engkvist

AstraZeneca AB

Göteborgs universitet

Chalmers, Data- och informationsteknik, Data Science och AI

Forskning Andra publikationer

Igor V. Tetko

Helmholtz-Gemeinschaft Deutscher Forschungszentren

Lecture Notes in Computer Science

0302-9743 (ISSN) 1611-3349 (eISSN)

Vol. 16072 LNCS 42-44
9783032045515 (ISBN)

34th International Conference on Artificial Neural Networks, ICANN 2025
Kaunas, Lithuania,

Ämneskategorier (SSIF 2025)

Bioinformatik (beräkningsbiologi)

Datavetenskap (datalogi)

DOI

10.1007/978-3-032-04552-2_6

Publikationsdata kopplat till DOI

Mer information

Senast uppdaterat

2025-10-17

MolEncoder: Improved Masked Language Modeling for Molecules Paper i proceeding, 2026

Författare

Fabian P. Krüger

Nicklas Österbacka

Mikhail Kabeshov

Ola Engkvist

Igor V. Tetko

Lecture Notes in Computer Science

Ämneskategorier (SSIF 2025)

DOI

Mer information

Senast uppdaterat

MolEncoder: Improved Masked Language Modeling for Molecules
Paper i proceeding, 2026