Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering
Doctoral thesis, 2021
High-throughput mass spectrometry platforms provide detailed snapshots of a cell's protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.
Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance.
tensor factorization
mass spectrometry
model interpretation
sequence feature engineering
deep learning
proteomics
machine learning
data-independent acquisition
feature learning
Author
Filip Buric
Chalmers, Biology and Biological Engineering, Systems and Synthetic Biology
However, machine learning approaches have proven themselves across many scientific and engineering fields in finding patterns and building complex models from large amounts of data, even without close human guidance. In my thesis, I show how such methods may be used to learn the different "languages" the cell uses to control its content of proteins. These biological machines play most of the functional and structural roles inside a cell and are thus critical to its well-being.
These models rely mostly on DNA and protein sequence information to make predictions about the cell and its components, thus capturing the building instructions encoded in the genome through evolution. By harnessing these languages of molecular assembly and logistics, we may predict various protein properties such as the temperature in which they work best and how abundant proteins are expected to be. Moreover, the models allow us to speak these languages ourselves to the extent that we can tweak protein features, opening avenues for medical applications and cellular factories.
Subject Categories (SSIF 2011)
Bioinformatics (Computational Biology)
Bioinformatics and Systems Biology
ISBN
978-91-7905-570-7
Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5037
Publisher
Chalmers
KA-salen, Kemigården 4, Chalmers
Opponent: Prof. Lukas Käll, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH – Royal Institute of Technology, Sweden