Typically, sequence signatures, such as motifs and domains, are assumed to be localized in one region of a sequence or are derived as combinations of the former. We generalize the concept of sequence signatures and introduce an algorithm for efficiently determining signatures based on subsequences that may be located anywhere on a sequence. In a preprocessing step, sequences are transformed into a feature space that is subsequently used to mine generalized signatures. We evaluate our signatures in relation to those in the InterPro database and highlight the differences between them. A second comparison with InterPro shows that our signatures can be used to derive sequence annotations with a higher confidence.
Dietmar H. Dorr, Anne Denton