Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering

Filip Buric

Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering
Doctoral thesis, 2021

The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.
High-throughput mass spectrometry platforms provide detailed snapshots of a cell's protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.
Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance.

tensor factorization

mass spectrometry

model interpretation

sequence feature engineering

deep learning

proteomics

machine learning

data-independent acquisition

feature learning

KA-salen, Kemigården 4, Chalmers

Opponent: Prof. Lukas Käll, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH – Royal Institute of Technology, Sweden

Author

Filip Buric

Chalmers, Biology and Biological Engineering, Systems and Synthetic Biology

Other publications Research

The complexity of life is awe-inspiring. It is also perplexing when we seek to understand and harness it to treat diseases and live more sustainably. Faced with both the mind-boggling intricacies of even the simplest single-cell organism and the deluge of experimental data, assembling an accurate picture of biology appears daunting and often we only have a handful of hypotheses on which to base our understanding.
However, machine learning approaches have proven themselves across many scientific and engineering fields in finding patterns and building complex models from large amounts of data, even without close human guidance. In my thesis, I show how such methods may be used to learn the different "languages" the cell uses to control its content of proteins. These biological machines play most of the functional and structural roles inside a cell and are thus critical to its well-being.
These models rely mostly on DNA and protein sequence information to make predictions about the cell and its components, thus capturing the building instructions encoded in the genome through evolution. By harnessing these languages of molecular assembly and logistics, we may predict various protein properties such as the temperature in which they work best and how abundant proteins are expected to be. Moreover, the models allow us to speak these languages ourselves to the extent that we can tweak protein features, opening avenues for medical applications and cellular factories.

Subject Categories (SSIF 2011)

Bioinformatics (Computational Biology)

Bioinformatics and Systems Biology

ISBN

978-91-7905-570-7

Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5037

Publisher

Chalmers