Discrete motifsthat discriminate functionalclasses of proteins are useful for classifying newsequences, capturingstructural constraints, andidentifyingprotein subclasses.Despitethe fact that the spaceof suchmotifs can grow exponentially with sequence length and number,weshowthat in practice it usuallydoesnot, and wedescribea techniquethat infers motifsfromaligned protein sequencesby exhaustivelysearchingthis space. Ourmethodgeneratessequencemotifsover a widerange of recall andprecision,andchoosesa representativemotif basedon a scorethat wederivefrombothstatistical and information-theoreticframeworks.Finally, weshowthat the selected motifsperformwell in practice, classifying unseensequenceswith extremely high precision, and infer protein subclasses that correspond to known biochemicalclasses.
Craig G. Nevill-Manning, Komal S. Sethi, Thomas D.