Several computeralgorithms for discovering patterns in groups of protein sequences are in use that are basedon fitting the parametersof a statistical model to a group of related sequences. Theseinclude hidden Markovmodel(HMM)algorithms for multiple sequence alignment, and the MEMEand Gibbs sampler aagorithms for discovering motifs. These algorithms axe sometimesprone to producingmodelsthat are incorrect because two or morepatterns have been tombitted. Thestatistical modelproducedin this situation is a convexcombination (weighted average) two or moredifferent models. This paper presents a solution to the problemof convexcombinationsin the formof a heuristic basedon using extremelylowvarianceDirichlet mixturepriors as past of the statistical model. This heuristic, which wecall the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequencedataset. This causes each columnin the final modelto strongly resemble the meano...
Timothy L. Bailey, Michael Gribskov