Exact match search in sequence data using suffix trees

16 years 15 days ago

Download users.encs.concordia.ca

We study suitable indexing techniques to support efficient exact match search in large biological sequence databases. We propose a suffix tree (ST) representation, called STA-DF, as an alternative to the array representation of ST (STA) proposed in [7] and utilized in [18]. To study the performance of STA and STA-DF, we develop a memory efficient ST-based Exact Match (STEM) search algorithm. We implemented STEM and both representations of ST and conducted extensive experiments. Our results indicate that the STA and STA-DF representations are very similar in construction time, storage utilization, and search time using STEM. In terms of the access patterns by STEM, our results show that compared to STA, the STA-DF representation exhibits better spatial and sequential locality of reference. This suggests that STA-DF would require less number of disk I/Os, and hence is more amenable to efficient and scalable disk-based computation. Categories and Subject Descriptors E.1 [Data Structures]...

Mihail Halachev, Nematollaah Shiri, Anand Thamildu

Real-time Traffic