Advancing systems biology of yeast through machine learning and comparative genomics
Doktorsavhandling, 2023

Synthetic biology has played a pivotal role in accomplishing the production of high value commodities, pharmaceuticals, and bulk chemicals. Fueled by the breakthrough of synthetic biology and metabolic engineering, Saccharomyces cerevisiae and various other yeasts (such as Yarrowia lipolytica, Pichia pastoris) have been proven to be promising microbial cell factories and are frequently used in scientific studies. However, the cellular metabolism and physiological properties for most of the yeast species have not been characterized in detail. To address these knowledge gaps, this thesis aims to leverage the large amounts of data available for yeast species and use state-of-the-art machine learning techniques and comparative genomic analysis to gain a deeper insight into yeast traits and metabolism.

In this thesis, machine learning was applied to various unresolved biological problems on yeasts, i.e., gene essentiality, enzyme turnover number (kcat), and protein production. In the first part of the work, machine learning approaches were employed to predict gene essentiality based on sequence features and evolutionary features. It was demonstrated that the essential gene prediction could be substantially improved by integrating evolution-based features. Secondly, a high-quality deep learning model DLKcat was developed to predict kcat values by combining a graph neural network for substrates and a convolutional neural network for proteins. By predicting kcat profiles for 343 yeast/fungi species, enzyme-constrained models were reconstructed and used to further elucidate the cellular metabolism on a large scale. Lastly, a random forest algorithm was adopted to investigate feature importance analysis on protein production, it was found that post-translational modifications (PTMs) have a relatively higher impact on protein production compared with amino acid composition.

In comparative genomics, a comprehensive toolbox HGTphyloDetect was developed to facilitate the identification of horizontal gene transfer (HGT) events. Case studies on some yeast species demonstrated the ability of HGTphyloDetect to identify horizontally acquired genes with high accuracy. In addition, through systematic evolution analysis (e.g., HGT, gene family expansion) and genome-scale metabolic model simulation, the underlying mechanisms for substrate utilization were further probed across large-scale yeast species.

enzyme turnover number

gene essentiality

horizontal gene transfer

machine learning

deep learning

yeast species

Hall KA, Chemistry building, Kemigården 4, Chalmers
Opponent: Prof. Huimin Zhao, University of Illinois at Urbana-Champaign (UIUC), USA

Författare

Le Yuan

Chalmers, Life sciences, Systembiologi

HGTphyloDetect: facilitating the identification and phylogenetic analysis of horizontal gene transfer

Briefings in Bioinformatics,; Vol. In Press(2023)

Artikel i vetenskaplig tidskrift

Deep learning-based k(cat) prediction enables improved enzyme-constrained model reconstruction

Nature Catalysis,; Vol. In Press(2022)

Artikel i vetenskaplig tidskrift

Improving recombinant protein production by yeast through genome-scale modeling using proteome constraints

Nature Communications,; Vol. 13(2022)

Artikel i vetenskaplig tidskrift

Yeast metabolic innovations emerged via expanded metabolic network and gene positive selection

Molecular Systems Biology,; Vol. 17(2021)

Artikel i vetenskaplig tidskrift

Over the years, synthetic biology has demonstrated its significant potential in producing various bulk chemicals, as well as ingredients for cosmetics and pharmaceuticals. To achieve this, microorganisms such as yeasts are commonly used as microbial cell factories. Yeasts are advantageous because they tend to be easy to culture, and many of them can be engineered using genetic toolboxes. Nevertheless, despite their widespread use, there are many yeasts that have not yet been studied in detail, and even for those that have been studied, there are still many gaps in the knowledge of their cellular processes. To this end, the rapid development of machine learning and comparative genomics techniques can aid in improving our understanding of yeasts, based on pre-existing data and knowledge.

Machine learning is a state-of-the-art technique that empowers computers to detect patterns and make predictions from large datasets. In this thesis, I used machine learning to predict gene essentiality (i.e., to identify gene deletions that can cause the death of a cell) in yeasts. I also identified biological patterns that can improve such predictions. This is important for future design of yeast cells and for drug target discovery. Moreover, I developed a deep learning model that predicts how fast and efficient an enzyme work, by only looking at its amino acids. The model was applied on more than 300 yeast species, to simulate how their metabolism work through the prediction of the speed and efficiency of around 3 million enzymes. In addition, I used machine learning to investigate factors that significantly impact protein production in yeasts. This has provided crucial knowledge that can be used in the design of future protein producing yeasts.

In this thesis, I also used comparative genomics. This is a valuable technique that complements machine learning to investigate complex biological problems. In my research, I developed a toolbox to detect so-called horizontal gene transfer (HGT) events. HGT occurs when a microorganism such as yeast acquires a gene from an external source instead of from its parents. Using this toolbox, it is possible to trace potential transmission routes of such HGT genes. Furthermore, with the aid of various comparative genomic analyses, I systematically explored the underlying mechanisms of substrate usage (i.e., the ability of a microorganism to use a particular substance to carry out its functions) in over 300 yeast species.

Ämneskategorier

Evolutionsbiologi

Bioinformatik (beräkningsbiologi)

Bioinformatik och systembiologi

Drivkrafter

Hållbar utveckling

Fundament

Grundläggande vetenskaper

Infrastruktur

C3SE (Chalmers Centre for Computational Science and Engineering)

ISBN

978-91-7905-818-0

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5284

Utgivare

Chalmers

Hall KA, Chemistry building, Kemigården 4, Chalmers

Opponent: Prof. Huimin Zhao, University of Illinois at Urbana-Champaign (UIUC), USA

Mer information

Senast uppdaterat

2023-05-05