An important goal in bioinformatics is determining the homology and function of proteins from their sequences. Pairwise sequence similarity algorithms are often employed for this purpose. This paper describes a method for improving the accuracy of such algorithms using knowledge about families of proteins. The method requires a library of protein families against which to compare query sequences. A standard pairwise similarity search algorithm is used to search the library with the query, and a new variant of the Family Pairwise Search (FPS) algorithm converts the results into a list sorted by the E-values of the matches between the query and the families. The E-value of each query-family match is calculated using a statistical distribution introduced here that describes the behavior of the product of the p-values of correlated random variables. We also describe an algorithm (ESIZE) for estimating the single parameter of this distribution. This parameter summarizes the amount of corre...
Timothy L. Bailey, William Noble Grundy