Advancing systems biology of yeast through machine learning and comparative genomics
Doctoral thesis, 2023
In this thesis, machine learning was applied to various unresolved biological problems on yeasts, i.e., gene essentiality, enzyme turnover number (kcat), and protein production. In the first part of the work, machine learning approaches were employed to predict gene essentiality based on sequence features and evolutionary features. It was demonstrated that the essential gene prediction could be substantially improved by integrating evolution-based features. Secondly, a high-quality deep learning model DLKcat was developed to predict kcat values by combining a graph neural network for substrates and a convolutional neural network for proteins. By predicting kcat profiles for 343 yeast/fungi species, enzyme-constrained models were reconstructed and used to further elucidate the cellular metabolism on a large scale. Lastly, a random forest algorithm was adopted to investigate feature importance analysis on protein production, it was found that post-translational modifications (PTMs) have a relatively higher impact on protein production compared with amino acid composition.
In comparative genomics, a comprehensive toolbox HGTphyloDetect was developed to facilitate the identification of horizontal gene transfer (HGT) events. Case studies on some yeast species demonstrated the ability of HGTphyloDetect to identify horizontally acquired genes with high accuracy. In addition, through systematic evolution analysis (e.g., HGT, gene family expansion) and genome-scale metabolic model simulation, the underlying mechanisms for substrate utilization were further probed across large-scale yeast species.
enzyme turnover number
gene essentiality
horizontal gene transfer
machine learning
deep learning
yeast species
Author
Le Yuan
Chalmers, Life Sciences, Systems and Synthetic Biology
HGTphyloDetect: facilitating the identification and phylogenetic analysis of horizontal gene transfer
Briefings in Bioinformatics,;Vol. In Press(2023)
Journal article
Deep learning-based k(cat) prediction enables improved enzyme-constrained model reconstruction
Nature Catalysis,;Vol. In Press(2022)
Journal article
Improving recombinant protein production by yeast through genome-scale modeling using proteome constraints
Nature Communications,;Vol. 13(2022)
Journal article
Yeast metabolic innovations emerged via expanded metabolic network and gene positive selection
Molecular Systems Biology,;Vol. 17(2021)
Journal article
Machine learning is a state-of-the-art technique that empowers computers to detect patterns and make predictions from large datasets. In this thesis, I used machine learning to predict gene essentiality (i.e., to identify gene deletions that can cause the death of a cell) in yeasts. I also identified biological patterns that can improve such predictions. This is important for future design of yeast cells and for drug target discovery. Moreover, I developed a deep learning model that predicts how fast and efficient an enzyme work, by only looking at its amino acids. The model was applied on more than 300 yeast species, to simulate how their metabolism work through the prediction of the speed and efficiency of around 3 million enzymes. In addition, I used machine learning to investigate factors that significantly impact protein production in yeasts. This has provided crucial knowledge that can be used in the design of future protein producing yeasts.
In this thesis, I also used comparative genomics. This is a valuable technique that complements machine learning to investigate complex biological problems. In my research, I developed a toolbox to detect so-called horizontal gene transfer (HGT) events. HGT occurs when a microorganism such as yeast acquires a gene from an external source instead of from its parents. Using this toolbox, it is possible to trace potential transmission routes of such HGT genes. Furthermore, with the aid of various comparative genomic analyses, I systematically explored the underlying mechanisms of substrate usage (i.e., the ability of a microorganism to use a particular substance to carry out its functions) in over 300 yeast species.
Subject Categories
Evolutionary Biology
Bioinformatics (Computational Biology)
Bioinformatics and Systems Biology
Driving Forces
Sustainable development
Roots
Basic sciences
Infrastructure
C3SE (Chalmers Centre for Computational Science and Engineering)
ISBN
978-91-7905-818-0
Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5284
Publisher
Chalmers
Hall KA, Chemistry building, Kemigården 4, Chalmers
Opponent: Prof. Huimin Zhao, University of Illinois at Urbana-Champaign (UIUC), USA