Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

14 years 26 days ago

Download www.biomedcentral.com

Background: The number of k-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, D2, has been used for the clustering of EST sequences. Sequence comparison based on D2 is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D2, and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied. Results: We have computed the D2 optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D2 to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 lette...

Sylvain Forêt, Miriam R. Kantorovitz, Conrad

Real-time Traffic

BMCBI 2006 | Optimal Word Size | Sequence Comparison | Sequences |

claim paper

Post Info
More Details (n/a)

Added	10 Dec 2010
Updated	10 Dec 2010
Type	Journal
Year	2006
Where	BMCBI
Authors	Sylvain Forêt, Miriam R. Kantorovitz, Conrad J. Burden

Comments (0)

Sciweavers

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

BMCBI 2006 | Optimal Word Size | Sequence Comparison | Sequences |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers