Sciweavers

VLDB
2002
ACM

Database indexing for large DNA and protein sequence collections

14 years 11 months ago
Database indexing for large DNA and protein sequence collections
Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200Mb of protein and 300Mbp of DNA, whose disk-image exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is ...
Ela Hunt, Malcolm P. Atkinson, Robert W. Irving
Added 05 Dec 2009
Updated 05 Dec 2009
Type Conference
Year 2002
Where VLDB
Authors Ela Hunt, Malcolm P. Atkinson, Robert W. Irving
Comments (0)