Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

14 years 5 months ago

Download www.informatik.hu-berlin.de

Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations only in a very limited and inefficient form that does not scale to the amount of data produced in Life Science projects. We present PETER, a prefix tree based indexing algorithm supporting approximate search and approimate joins. Our tool supports Hamming and edit distance as similarity measure and is available as C++ library, as Unix command line tool, and as cartridge for a commercial database. It combines an efficient implementation of compressed prefix trees with advanced pre-filtering techniques that exclude many candidate strings early. The achieved speed-ups are dramatic, especially for DNA with its small alphabet. We evaluate our...

Astrid Rheinländer, Martin Knobloch, Nicky Ho

Real-time Traffic

Database | RDBMS Offer Similarity | Similarity Joins | Similarity Search | SSDBM 2010 |

claim paper

Post Info
More Details (n/a)

Added	02 Aug 2010
Updated	02 Aug 2010
Type	Conference
Year	2010
Where	SSDBM
Authors	Astrid Rheinländer, Martin Knobloch, Nicky Hochmuth, Ulf Leser

Comments (0)

Sciweavers

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

Database | RDBMS Offer Similarity | Similarity Joins | Similarity Search | SSDBM 2010 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers