Statistical analysis of metagenomic data
Licentiatavhandling, 2014

Metagenomics is the study of microbial communities on the genome level by direct sequencing of environmental and clinical samples. Recently developed DNA sequencing technologies have made metagenomics widely applicable and the field is growing rapidly. The statistical analysis is however challenging due to the high variability present in the data which stems from the underlying biological diversity and complexity of microbial communities. Metagenomic data is also high-dimensional and the number of replicates is typically few. Many standard methods are therefore unsuitable and there is a need for developing new statistical procedures. This thesis contains two papers. In the first paper we perform an evaluation of statistical methods for comparative metagenomics. The ability to detect differentially abundant genes and control error rates is evaluated for eleven methods previously used in metagenomics. Resampled data from a large metagenomic data set is used to provide an unbiased basis for comparisons between methods. The number of replicates, the effect size and the gene abundance are all shown to have a large impact on the performance. The statistical characteristics of the evaluated methods can serve as a guide for the statistical analysis in future metagenomic studies. The second paper describes a new statistical method for the analysis of metagenomic data. The underlying model is formulated within the framework of a hierarchical Bayesian generalized linear model. A joint prior is placed on the variance parameters and shared between all genes. We evaluate the model and show that it improves the ability to detect differentially abundant genes. This thesis underlines the importance of sound statistical analysis when the data is noisy and high-dimensional. It also demonstrates the potential of statistical modeling within metagenomics.

Environmental genomics

Statistical power

False discovery rate

Generalized linear models

Count data

Metagenomics

Hierarchical Bayesian models

Statistical methods

Pascal, Matematiska vetenskaper, Chalmers tvärgata 3
Opponent: Doktor Ingrid Lönnstedt, Statistikon AB / Walter and Elize Hall Institute, Melbourne, Australia

Författare

Viktor Jonsson

Göteborgs universitet

Chalmers, Matematiska vetenskaper

Ämneskategorier

Bioinformatik och systembiologi

Sannolikhetsteori och statistik

Preprint / Department of Mathematical Sciences, Chalmers University of Technology and Göteborg University: 2014:22

Pascal, Matematiska vetenskaper, Chalmers tvärgata 3

Opponent: Doktor Ingrid Lönnstedt, Statistikon AB / Walter and Elize Hall Institute, Melbourne, Australia