Transformational machine learning: Learning how to learn from many related scientific problems
Artikel i vetenskaplig tidskrift, 2021

Almost all machine learning (ML) is based on representing examples using intrinsic features. When there are multiple related ML problems (tasks), it is possible to transform these features into extrinsic features by first training ML models on other tasks and letting them each make predictions for each example of the new task, yielding a novel representation. We call this transformational ML (TML). TML is very closely related to, and synergistic with, transfer learning, multitask learning, and stacking. TML is applicable to improving any nonlinear ML method. We tested TML using the most important classes of nonlinear ML: random forests, gradient boosting machines, support vector machines, k-nearest neighbors, and neural networks. To ensure the generality and robustness of the evaluation, we utilized thousands of ML problems from three scientific domains: drug design, predicting gene expression, and ML algorithm selection. We found that TML significantly improved the predictive performance of all the ML methods in all the domains (4 to 50% average improvements) and that TML features generally outperformed intrinsic features. Use of TML also enhances scientific understanding through explainable ML. In drug design, we found that TML provided insight into drug target specificity, the relationships between drugs, and the relationships between target proteins. TML leads to an ecosystem-based approach to ML, where new tasks, examples, predictions, and so on synergistically interact to improve performance. To contribute to this ecosystem, all our data, code, and our ∼50,000 ML models have been fully annotated with metadata, linked, and openly published using Findability, Accessibility, Interoperability, and Reusability principles (∼100 Gbytes).

Drug design

Stacking

Transfer learning

Multitask learning

AI

Författare

Ivan Olier

Liverpool John Moores University

Oghenejokpeme I. Orhobor

University of Cambridge

Tirtharaj Dash

Birla Institute of Technology and Science Pilani

Andy M. Davis

AstraZeneca AB

Larisa N. Soldatova

Goldsmiths, University of London

Joaquin Vanschoren

Technische Universiteit Eindhoven

Ross King

Chalmers, Biologi och bioteknik, Systembiologi

University of Cambridge

Alan Turing Institute

Proceedings of the National Academy of Sciences of the United States of America

0027-8424 (ISSN) 1091-6490 (eISSN)

Vol. 118 49 e2108013118

Ämneskategorier

Bioinformatik (beräkningsbiologi)

Bioinformatik och systembiologi

Datavetenskap (datalogi)

DOI

10.1073/pnas.2108013118

PubMed

34845013

Mer information

Senast uppdaterat

2021-12-30