In this paper, we present a new system, called GeneScout, for predicting gene structures in vertebrate genomic DNA. The system contains specially designed hidden Markov models (HMMs) for detecting functional sites including protein-translation start sites, mRNA splicing junction donor and acceptor sites, etc. Our main hypothesis is that, given a vertebrate genomic DNA sequence , it is always possible to construct a directed acyclic graph ¡ such that the path for the actual coding region of is in the set of all paths on ¡ . Thus, the gene detection problem is reduced to that of analyzing the paths in the graph ¡ . A dynamic programming algorithm is used to find the optimal path in ¡ . The proposed system is trained using an expectation-maximization (EM) algorithm and its performance on vertebrate gene prediction is evaluated using the 10-way cross-validation method. Experimental results show the good performance of the proposed system and its complementarity to a widely used...
Michael M. Yin, Jason Tsong-Li Wang