Application of machine learning in systems biology
In such applications, regression models are frequently used, and their performance relies on many factors, including but not limited to feature engineering and quality of response values. Manually engineering sufficient relevant features is particularly challenging in biology due to the lack of knowledge in certain areas. With the increasing volume of big data, deep-transfer learning enables us to learn a statistical summary of the samples from a big dataset which can be used as input to train other ML models. In the present thesis, I applied this approach to first learn a deep representation of enzyme thermal adaptation and then use it for the development of regression models for predicting enzyme optimal and protein melting temperatures. It was demonstrated that the transfer learning-based regression models outperform the classical ones trained on rationally engineered features in both cases. On the other hand, noisy response values are very common in biological datasets due to the variation in experimental measurements and they fundamentally restrict the performance attainable with regression models. I thereby addressed this challenge by deriving a theoretical upper bound for the coefficient of determination (R2) for regression models. This theoretical upper bound depends on the noise associated with the response variable and variance for a given dataset. It can thus be used to test whether the maximal performance has been reached on a particular dataset, or whether further model improvement is possible.
deep transfer learning
Chalmers, Biologi och bioteknik, Systembiologi
Machine Learning Applied to Predicting Microorganism Growth Temperatures and Enzyme Catalytic Optima
ACS Synthetic Biology,; Vol. 8(2019)p. 1411-1420
Artikel i vetenskaplig tidskrift
The pan-genome of Saccharomyces cerevisiae
FEMS Yeast Research,; Vol. 19(2019)
Artikel i vetenskaplig tidskrift
Li G, Hu Y, Wang H, Zelezniak A, Ji B, Zrimec J and Nielsen J. Bayesian genome scale modeling identifies thermal determinants of yeast metabolism
Li G, Zrimec J, Ji B, Geng J, Larsbrink J, Zelezniak A, Nielsen J and Engqvist MKM. Performance of regression models as a function of experiment noise
Li G, Zrimec J, Viknander S, Zelezniak A, Nielsen J and Engqvist MKM. Learning deep representations of enzyme thermal adaptation
In this thesis, I firstly explore and discuss the different application scenarios of machine learning in systems biology. Among other applications, I show that i) machine learning can be used to model the systems in which the relationships between the components and the system are too complex to be modelled with theory-based models; ii) machine learning can be used to improve the existing theory-based models. Secondly, machine learning approaches rely heavily on the quality and volume of data, which is limiting in most biological datasets. In regard to the quality of data, I evaluate the effect of noise in the development of regression models with both theoretical analysis and simulations. In regard to the volume of data, I showcase how deep-transfer learning can be applied to datasets with only a small number of training samples.
The results presented in this thesis show that machine learning is a powerful tool and its applications in systems biology are still on-going with many challenges to be solved. In the future, it will become one of the standard tools in the toolbox of every systems biologist.
Predictive and Accelerated Metabolic Engineering Network (PAcMEN)
Europeiska kommissionen (EU) (EC/H2020/722287), 2016-09-01 -- 2020-08-30.
C3SE (Chalmers Centre for Computational Science and Engineering)
Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 4757
Opponent: Vassily Hatzimanikatis, EPFL, Switzerland