Computational methods for analysis of fragmented sequence data
Doctoral thesis, 2015
Recent developments in genomic and proteomic sequencing technologies have
revolutionized research in life sciences, providing new opportunities for the
study of biological systems. However, modern sequence data sets are large,
diverse, and heavily fragmented, which presents new challenges for their
analysis and interpretation. In this thesis we present six research papers,
that describe novel methods for studying bacteria and bacterial communities
through the analysis of large data sets produced by modern DNA and protein
sequencing technologies.
In Paper I, we describe a method for discovering fragments of fluoroquinolone
antibiotic resistance genes in short fragments of DNA. The resistance phenotypes
of the predicted resistance genes were then validated by expression
in an Escherichia coli host (Paper II). The method was further improved to
handle larger and more fragmented data sets in Paper III. In Paper IV, we
present Tentacle, an easy-to-use tool for high performance gene quantification
in metagenomes that can be run on distributed computing resources to enable
fast and efficient gene quantification in terabase metagenomes. In Paper V,
we introduce proteotyping, an approach for microbial identification in clinical
samples based on shotgun proteomics. Finally, in Paper VI we describe and
evaluate a method for proteotyping analysis suited for application to clinical
diagnostics of bacterial infections.
The rapidly increasing volumes of data produced by new sequencing technologies
provide new opportunities for understanding microbial biology. To
unlock the full potential of large sequence data sets requires novel methods
and approaches such as those presented in this thesis.
bioinformatics
sequencing
distributed computing
proteomics
metagenomics