Modelling of zero-inflation improves inference of metagenomic gene count data
Artikel i vetenskaplig tidskrift, 2019

Metagenomics enables the study of gene abundances in complex mixtures of microorganisms and has become a standard methodology for the analysis of the human microbiome. However, gene abundance data is inherently noisy and contains high levels of biological and technical variability as well as an excess of zeros due to non-detected genes. This makes the statistical analysis challenging. In this study, we present a new hierarchical Bayesian model for inference of metagenomic gene abundance data. The model uses a zero-inflated overdispersed Poisson distribution which is able to simultaneously capture the high gene-specific variability as well as zero observations in the data. By analysis of three comprehensive datasets, we show that zero-inflation is common in metagenomic data from the human gut and, if not correctly modelled, it can lead to substantial reductions in statistical power. We also show, by using resampled metagenomic data, that our model has, compared to other methods, a higher and more stable performance for detecting differentially abundant genes. We conclude that proper modelling of the gene-specific variability, including the excess of zeros, is necessary to accurately describe gene abundances in metagenomic data. The proposed model will thus pave the way for new biological insights into the structure of microbial communities.

Metagenomics

MCMC

environmental sequencing

zero-inflation

Markov chain Monte Carlo

Bayesian modeling

generalized linear models

human microbiome

Författare

Viktor Jonsson

Chalmers, Matematiska vetenskaper

CSBI

Tobias Österlund

Chalmers, Matematiska vetenskaper, Tillämpad matematik och statistik

Olle Nerman

Chalmers, Matematiska vetenskaper, Tillämpad matematik och statistik

Erik Kristiansson

Chalmers, Matematiska vetenskaper, Tillämpad matematik och statistik

Statistical Methods in Medical Research

0962-2802 (ISSN)

Vol. 28 12 3712-3728

Ämneskategorier

Bioinformatik (beräkningsbiologi)

Sannolikhetsteori och statistik

Styrkeområden

Livsvetenskaper och teknik (2010-2018)

DOI

10.1177/0962280218811354

PubMed

30474490

Mer information

Senast uppdaterat

2019-09-19