Many of the same modeling methods used in natural languages, speci cally Markov models and HMM's, have also been applied to biological sequence analysis. In recent years, natural language models have been improved upon by using maximum entropy methods which allow information based upon the entire history of a sequence to be considered. This is in contrast to the Markov models, whose predictions generally are based on some xed number of previous emissions, that have been the standard for most biological sequence models. To test the utility of Maximum Entropy modeling for biological sequence analysis, we used these methods to model amino acid sequences. Our results show that there is signi cant long-distance information in amino acid sequences and suggests that maximum entropy techniques may be bene cial for a range of biological sequence analysis problems. Keywords maximum entropy, amino acids, sequence analysis
Eugen C. Buehler, Lyle H. Ungar