Network models with applications to genomic data: generalization, validation and uncertainty assessment
Doktorsavhandling, 2014
The aim of this thesis is to provide a framework for the estimation and analysis of transcription networks in human cancer. The methods we develop are applied to data collected by The Cancer Genome Atlas (TCGA) and supporting simulations are based on derived models in order to reflect real data structure. Nevertheless, our proposed models apply to network construction for any data type. The thesis includes four papers, all of them adressing different aspects of network estimation.
Statistical analysis of high-dimensional data requires regularization. Network model validation amounts to selection of regularization parameters which control sparsity and, possibly, some common structure across different data classes (here, types of cancer). In paper I we present a bootstrap-based method to perform sparsity selection and robust network construction. We show, by simulation studies, that our proposed methods select sparsity to control false positive rate, rather than match the size of the true underlying network.
In paper II we address the problem of uncertainty in network estimation. Since network estimation is very unstable, uncertainty is an important issue to focus on, in order to avoid overintepretation of results. Using ideas from information theory, we introduce a method that assesses uncertainty by presenting a set of network candidate estimates, rather than a single network model. The method enables us to show that different network topologies have different estimation properties, and that each network estimation method's performance depends on this topology.
It is often of interest to identify and study the commonalities and differences in network estimates across several classes (here, types of cancer) and data types. Statistical network models, like the graphical lasso, provide a framework in which several classes and data types can be integrated. Paper III makes use of such framework and presents a method that allows for large scale sparse inverse covariance estimation of several classes. Through application of priors, we account for plausible connections across different data types. The proposed method also encourages the expected modular structure of biological networks and corrects for unbalanced sample sizes across classes. The estimated networks are part of a publicly accessible resource termed Cancer Landscapes (\url{cancerlandscapes.org}), which provides a setting for interactive analysis in relation of pathway and pharmacological databases, diagnoses, survival associations and drug targets.
Traditionally, the analysis of genomic data has aimed for the study of differential expression. In paper IV we propose a way to integrate differential expression analysis with network estimation. To that end we extend upon existing methods in order to jointly estimate sparse mean vectors and precision matrices across several classes, thus gaining over analyses that focus on one or the other. Additionally, by assuming a block diagonal structure in the precision matrices, the problem can be recast into an ensemble classifier where each block becomes part of either a linear or a quadratic discriminant function.
networks
Inverse covariance matrix
TCGA pan cancer analysis
online resource
low-sample
cancer
discriminant analysis
precision matrix
sparsity
graphical models
elastic net
classification
fused lasso
high-dimension