Statistical analysis of gene expression data
Doctoral thesis, 2007
Microarray technology has become one of the most important tools for genome-wide mRNA
measurements. The technique has been successfully applied to many areas in modern
biology including cancer research, identification of drug targets, and categorization of genes
involved in the cell cycle. Nevertheless, the analysis of microarray data is difficult due to
the vast dimensionality and the high levels of noise. The need for solid statistical methods
is therefore strong.
The main results are presented in six papers. The first three develop a statistical model
for quality assessment and improved gene ranking called Weighted Analysis of Microarray
Experiments (WAME). Here, the customary assumption of independent samples is shown
to be invalid and individual variances for each array and correlations between pairs of arrays
are introduced. Comparisons to other common methods suggest that the proposed model
produces more accurate results. The first paper describes the model for simple experimental
setups for two-channel arrays. This model is then generalized to more complex designs in
paper two and to one-channel microarrays in paper three.
Transcription factors govern gene expression in the cell by binding to short sequences
called cis-regulatory elements. These sequences are located in the promoters, which are
regions of DNA upstream of the genes. In paper four, we show that the lengths of these
promoters are related to gene function. In particular, the promoters for stress responsive
genes are in general longer than those of other genes. This is used in a novel method for
identifying relevant cis-regulatory elements from a list of differentially expressed genes.
Papers five and six present microarray based studies from molecular biology and environmental
toxicology respectively. In paper five, microarrays are used to identify Saccharomyces
cerevisiae genes with changed mRNA levels under arsenic stress. In paper
six, biomarkers for estrogen exposure in fish are found using both an in-house microarray
experiment and a meta-analysis of several public gene expression datasets.
gene expression
linear models
categorical data analysis
heavy metal stress
logistic regression
DNA microarrays
empirical Bayes
ecotoxicology
quality control
gene regulation