A method is described for identification and classification of proteins encoded in large DNA sequences. Previously, an automated system was introduced for the general detection of amino acid sequence motifs within diverse protein families. The system generated a database consisting of aligned sequence segments (blocks) that correspond to the most highly conserved regions of proteins. This database of blocks can be searched using protein queries for sensitive detection of homology based on the detection of both local and global similarities. Here we show that this database searching approach can also be used to detect distant relatives encoded in very large DNA sequences. The approach is illustrated by the detection of known and new relationships in the 315 kilobase (kb) sequence of yeast chromosome III.
Steven Henikoff, Jorja G. Henikoff