Analysis of large-scale metagenomic data
Licentiate thesis, 2013

The topic of this thesis is the analysis of large data sets of DNA sequence data produced from modern high-throughput DNA sequencing machines. Using such machines to sequence the genetic content of a microbial community produces a metagenome. This thesis comprises three research papers, all connected to the study of large metagenomic data sets. In the first paper, we developed a method for discovering fragments of fluoroquinolone antibiotic resistance genes in short fragments of DNA. The method uses hidden Markov models for identifying qnr genes in short DNA fragments. Cross-validation showed that our method for classifying short fragments has high statistical power even for fragments as short as 100 base pairs, a length commonly encountered in modern next-generation sequencing data. In the second paper, the putative qnr genes identified in the first paper were verified using wet-lab experiments. This was a follow-up study to validate the findings from the first paper. An expression system for qnr genes in Escherichia coli hosts was developed and used to evaluate the resistance phenotype of the novel gene candidates discovered in the first paper. In the third paper, we developed an easy-to-use high performance method for distributed gene quantification in metagenomic sequence data. It leverages high-performance computing resources to provide high throughput while maintaining sensitivity. This enables efficient and accurate gene quantification, suitable for use in comparative metagenomics. Next-generation DNA sequencing has had a big impact on molecular biology. As the size of the produced data sets increases, there is an equally increasing need for methods suited for the analysis of such data sets. This thesis presents several new methods that are well adapted to analysis of modern terabase-sized metagenomic data sets.

distributed computing

high-performance computing

hidden Markov models

big data

DNA analysis

metagenomics

antibiotic resistance

Pascal, Matematiska Vetenskaper, Chalmers Tvärgata 3, Chalmers Tekniska Högskola, Göteborg
Opponent: Dr. Sean Hooper, The Institute of Cancer Research, London, England

Author

Fredrik Boulund

University of Gothenburg

Chalmers, Mathematical Sciences, Mathematical Statistics

Driving Forces

Sustainable development

Infrastructure

C3SE (Chalmers Centre for Computational Science and Engineering)

Subject Categories

Bioinformatics (Computational Biology)

Bioinformatics and Systems Biology

Genetics

Areas of Advance

Life Science Engineering (2010-2018)

Preprint - Department of Mathematical Sciences, Chalmers University of Technology and Göteborg University: 2013:17

Pascal, Matematiska Vetenskaper, Chalmers Tvärgata 3, Chalmers Tekniska Högskola, Göteborg

Opponent: Dr. Sean Hooper, The Institute of Cancer Research, London, England

More information

Created

10/7/2017