A compressed full-text self-index for a text T , of size u, is a data structure used to search for patterns P, of size m, in T , that requires reduced space, i.e. space that depends on the empirical entropy (Hk or H0) of T , and is, furthermore, able to reproduce any substring of T . In this paper we present a new compressed self-index able to locate the occurrences of P in O((m + occ) log u) time, where occ is the number of occurrences. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O(m2 ) to O(m). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZ-indexes, the T78 suffix tree. We show that our method is very competitive in practice by comparing it against other state of the art compressed indexes.
Luís M. S. Russo, Arlindo L. Oliveira