Identification of regulatory signals in DNA depends on the nature and quality of the patterns of representative sequences. These patterns are constructed from training sets of sequences by means of probabilistic models that either assume independence between positions or that suffer from considerable computational complexity. We have developed and tested higher order models that account for significant dependent position pairs or triads, thereby capturing position-dependent information hidden in DNA binding sites. We have evaluated our algorithm on several data sets, including eukaryotic and bacterial transcription factor binding sites and shown that the scores from the higher order representation of binding sites have significant positive correlation to the binding affinity scores.
Hossein Zare, Mostafa Kaveh, Arkady B. Khodursky