Application of machine learning in systems biology
Doctoral thesis, 2020

Biological systems are composed of a large number of molecular components. Understanding their behavior as a result of the interactions between the individual components is one of the aims of systems biology. Computational modelling is a powerful tool commonly used in systems biology, which relies on mathematical models that capture the properties and interactions between molecular components to simulate the behavior of the whole system. However, in many biological systems, it becomes challenging to build reliable mathematical models due to the complexity and the poor understanding of the underlying mechanisms. With the breakthrough in big data technologies in biology, data-driven machine learning (ML) approaches offer a promising complement to traditional theory-based models in systems biology. Firstly, ML can be used to model the systems in which the relationships between the components and the system are too complex to be modelled with theory-based models. Two such examples of using ML to resolve the genotype-phenotype relationships are presented in this thesis: (i) predicting yeast phenotypes using genomic features and (ii) predicting the thermal niche of microorganisms based on the proteome features. Secondly, ML naturally complements theory-based models. By applying ML, I improved the performance of the genome-scale metabolic model in describing yeast thermotolerance. In this application, ML was used to estimate the thermal parameters by using a Bayesian statistical learning approach that trains regression models and performs uncertainty quantification and reduction. The predicted bottleneck genes were further validated by experiments in improving yeast thermotolerance.

In such applications, regression models are frequently used, and their performance relies on many factors, including but not limited to feature engineering and quality of response values. Manually engineering sufficient relevant features is particularly challenging in biology due to the lack of knowledge in certain areas. With the increasing volume of big data, deep-transfer learning enables us to learn a statistical summary of the samples from a big dataset which can be used as input to train other ML models. In the present thesis, I applied this approach to first learn a deep representation of enzyme thermal adaptation and then use it for the development of regression models for predicting enzyme optimal and protein melting temperatures. It was demonstrated that the transfer learning-based regression models outperform the classical ones trained on rationally engineered features in both cases. On the other hand, noisy response values are very common in biological datasets due to the variation in experimental measurements and they fundamentally restrict the performance attainable with regression models. I thereby addressed this challenge by deriving a theoretical upper bound for the coefficient of determination (R2) for regression models. This theoretical upper bound depends on the noise associated with the response variable and variance for a given dataset. It can thus be used to test whether the maximal performance has been reached on a particular dataset, or whether further model improvement is possible.

uncertainty

Machine learning

genome-scale modelling

deep transfer learning

systems biology

regression

Opponent: Vassily Hatzimanikatis, EPFL, Switzerland

Author

Gang Li

Chalmers, Biology and Biological Engineering, Systems and Synthetic Biology

The pan-genome of Saccharomyces cerevisiae

FEMS Yeast Research,;Vol. 19(2019)

Journal article

Performance of Regression Models as a Function of Experiment Noise

Bioinformatics and Biology Insights,;Vol. 15(2021)

Journal article

Learning deep representations of enzyme thermal adaptation

Protein Science,;Vol. 31(2022)

Journal article

The cell is the fundamental unit of any living organism. If we look into a single cell, it is composed of a massive number of different small components - molecules, which are highly organized and frequently interact with each other. The behavior of the cell depends on the interactions among these molecules, much like a society depends on its individuals. Understanding how those interactions give rise to cellular behavior is the major objective of systems biology. Mathematical models are among the most powerful tools to achieve this objective. Such models can be used to understand and predict how a cell responds to the external or internal signals and perturbations. However, building mathematical models is not easy, since it requires a deep understanding of each cellular component as well as of the interactions between the different components, which is usually not available. In this respect, machine learning approaches, which can directly learn a black-box model from data, while less depending on the biological knowledge, seem to be a very promising complement to the traditional theory and knowledge-based modelling approaches.

In this thesis, I firstly explore and discuss the different application scenarios of machine learning in systems biology. Among other applications, I show that i) machine learning can be used to model the systems in which the relationships between the components and the system are too complex to be modelled with theory-based models; ii) machine learning can be used to improve the existing theory-based models. Secondly, machine learning approaches rely heavily on the quality and volume of data, which is limiting in most biological datasets. In regard to the quality of data, I evaluate the effect of noise in the development of regression models with both theoretical analysis and simulations. In regard to the volume of data, I showcase how deep-transfer learning can be applied to datasets with only a small number of training samples.

The results presented in this thesis show that machine learning is a powerful tool and its applications in systems biology are still on-going with many challenges to be solved. In the future, it will become one of the standard tools in the toolbox of every systems biologist.

Predictive and Accelerated Metabolic Engineering Network (PAcMEN)

European Commission (EC) (EC/H2020/722287), 2016-09-01 -- 2020-08-30.

Subject Categories

Biological Sciences

Roots

Basic sciences

Infrastructure

C3SE (Chalmers Centre for Computational Science and Engineering)

ISBN

978-91-7905-290-4

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 4757

Publisher

Chalmers

Online

Opponent: Vassily Hatzimanikatis, EPFL, Switzerland

More information

Latest update

11/9/2023