Growing interest in genomic research has resulted in the creation of huge biological sequence databases. In this paper, we present a hash-based pier model for efficient homology search in large DNA sequence databases. In our model, only certain segments in the databases called `piers' need to be accessed during searches as opposite to other approaches which require a full scan on the biological sequence database. To further improve the search efficiency, the piers are stored in a specially designed hash table which helps to avoid expensive alignment operation. The hash table is small enough to reside in main memory, hence avoiding I/O in the search steps. We show theoretically and empirically that the proposed approach can efficiently detect biological sequences that are similar to a query sequence with very high sensitivity.
Xia Cao, Shuai Cheng Li, Beng Chin Ooi, Anthony K.