Next generation sequencing (NGS) have revolutionized molecular biology. The bottle neck is the availability of computational tools to assign meaning to all this data. We propose a comparative gene finding tool that integrates NGS, in particular RNA-Seq, with computational gene finding. RNA-Seq enables us to identify all gene products in a sample without the need for a reference genome or database, including all alternative transcripts and splice forms as well as their relative expression levels. However, there are still issues with low coverage, rarely expressed transcripts, and sequencing errors, that makes the assignment of the RNA-Seq transcripts to their genome locations dubious. Computational gene finding methods do not only identify regions that are likely to harbour genes, but construct complete gene models that adhere to the rules imposed by the transcription machinery. Combining the strengths of RNA-Seq data and computational gene finding will greatly improve the use of such data. Our model is based on Conditional Random Fields (CRFs), which are the discriminative counterpart of Hidden Markov Models (HMMs), but without suffering from the inherent limitations of HMMs of imposing numerous, fairly realistic, independence assumptions on the model features. CRFs are flexible in terms of handling complex dependencies, as well as incorporating a variety of model features, which make them unusually well suited for the purpose.
at Mathematical Sciences, Mathematical Statistics
Funding years 2011–2013