Life originated in the sea, and the species variety that we observe today has its roots in marine biology. Besides possessing the most diverse and unique genomes among all living things, marine organisms serve as illustrative indicators of climate change. This project proposal focuses on the statistical treatment of biological sequence data in genomic pipelines. The foundation of the proposal is a larger project, which involves building an infrastructure for marine genomics research, in which our role is the development and management of a bioinformatics pipeline for high-throughput genome sequence analysis. The statistical modeling of biological sequence motifs, or sequence words, runs as a red thread through such a pipeline. In the sequence assembly, separating sequences of different origin. In the genome characterization, identifying functional elements in the sequence. In comparative analyzes, characterizing the evolutionary relationships between species. In the functional analyzes, comparing the gene predictions to known gene and protein families. In gene expression analysis, computing the relative abundance of various transcripts in various situations. All these analyzes rests upon the robust characterization and statistical modeling of biological sequence ``words´´. Such word models will then be used for clustering genomic signatures, hypothesis testing between different gene sets, and comparative gene finding over large evolutionary distances.
at Mathematical Sciences, Mathematical Statistics
Funding years 2012–2015
Area of Advance