Statistical assessment of somatic mutations and genomic variability using DNA sequence data
The development of new DNA sequencing techniques have made it possible to generate high-resolution genomic data at an unprecedented pace. However, the high dimensionality in combination with the substantial levels of technical errors and biological variability make the analysis challenging. Tailored statistical methods need therefore to be developed and applied in order to facilitate correct biological interpretation. The first two papers in this thesis are focused on finding tumor-specific (somatic) mutations in cancer, while in the third paper a new method to assess genomic variability in microbial communities is developed.
In paper I, the aim was to characterize somatic mutations in pheochromocytoma/paraganglioma, and to identify mutations that contribute to malignancy.
Statistical analysis of exome sequencing data from nine replicated paired normal--tumor samples revealed 225 unique somatic mutations. A significantly higher rate of mutations was found in malignant compared to benign tumors. In addition, three genes with recurrent somatic mutations, exclusively located in malignant tumors, were identified.
In paper II, exome sequencing data was used to detect somatic mutations in 17 patients with acute myeloid leukemia. The identified mutations were evaluated as markers in a more sensitive analysis of remaining cancer cell levels after treatment. All but one of the studied patients were found to have potential markers in their somatic mutation profiles.
In paper III, a hierarchical Bayesian model for detecting genetic differences on nucleotide level between groups of microbial communities is proposed. The model is based on a Dirichlet-multinomial distribution and takes both within- and between-sample variability into account. The evaluation of the performance show that the model has a high sensitivity and maintains a low false positive rate even when the between-sample variability is high.
The thesis demonstrates the importance of dedicated statistical analysis and understanding of the error structure in DNA sequence data, in order to assure accurate identification of mutations and differences in genomic variability.
calling of somatic mutations
hierarchical Bayesian model
DNA sequence data