Quality-Based Similarity Search for Biological Sequence Databases

14 years 2 months ago

Download www.cise.ufl.edu

Low-Complexity Regions (LCRs) of biological sequences are the main source of false positives in similarity searches for biological sequence databases. We consider the problem of ﬁnding similar sequences when the locations of the LCRs are not known precisely. We develop a formulation to measure the quality of each letter in a sequence. The quality value of a letter is the probability for that letter to be in a non-LCR. We show that the quality values can be employed in two fundamental approaches to the sequence search problem to reduce the number of false positives produced by them signiﬁcantly. The former ﬁnds the optimal alignment of two sequences using dynamic programming. The latter computes a suboptimal alignment using hash table. For the latter one, we also develop a randomized memory-resident hash table that indexes k-grams (sequences of length k) probabilistically. The kgrams that are likely to contain LCRs are indexed with lower probabilities. As a result, memory usage a...

Xuehui Li, Tamer Kahveci

Real-time Traffic

BIOCOMP 2007 | Bioinformatics | False Positives | Hash Table | Sequences |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2007
Where	BIOCOMP
Authors	Xuehui Li, Tamer Kahveci

Comments (0)

Sciweavers

Quality-Based Similarity Search for Biological Sequence Databases

BIOCOMP 2007 | Bioinformatics | False Positives | Hash Table | Sequences |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers