We present an efficient, fully automated algorithm to assemble ESTs into full-length cDNA sequences that represent the complete coding regions of a gene. Our EST clustering algorithm is neither hierarchical nor incremental, but recursive, processing each EST once. The algorithm exploits a variety of syntactic and statistical features of the ESTs. The resulting assembly shows significant improvement in computational efficiency and information extraction over a previous assembly of C. reinhardtii ESTs. The algorithm was developed using iterative and participatory design on C. reinhardtii; however, it can be used for any organism with a draft genomic sequence.
Arthur Grossman, Charles Hauser, Hilary J. Holz, J