Direct Pattern Matching on Compressed Text

15 years 10 months ago

Download www.dcc.uchile.cl

We present a fast compression and decompression technique for natural language texts. The novelty is that the exact search can be done on the compressed text directly, using any known sequential pattern matching algorithm. Approximate search can also be done e ciently without any decoding. The compression scheme uses a semi-static word-based modeling and a Hu man coding where the coding alphabet is byte-oriented rather than bit-oriented. We use the rst bit of each byte to mark the beginning of a word, which allows the searching of the compressed pattern directly on the compressed text. We achieve about 33% compression ratio for typical English texts. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is almost twice as fast as running agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than agrep.

Edleno Silva de Moura, Gonzalo Navarro, Nivio Zivi

Real-time Traffic

Compressed Text | Information Management | Pattern Matching Algorithm | Semi-static Word-based Modeling | SPIRE 1998 |

claim paper

» Multiple Pattern Matching in LZW Compressed Text

» Approximate Searching on Compressed Text

» Pattern Matching in Text Compressed by Using Antidictionaries

» A General Practical Approach to Pattern Matching over ZivLempel Compressed Text

» Fast Searching on Compressed Text Allowing Errors

» A DictionaryBased Compressed Pattern Matching Algorithm

» An Efficient Pattern Matching Algorithm on a Subclass of Context Free Grammars

» Multiple Pattern Matching Algorithms on Collage System

Post Info
More Details (n/a)

Added	06 Aug 2010
Updated	06 Aug 2010
Type	Conference
Year	1998
Where	SPIRE
Authors	Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, Ricardo A. Baeza-Yates

Comments (0)

Sciweavers

Direct Pattern Matching on Compressed Text

Compressed Text | Information Management | Pattern Matching Algorithm | Semi-static Word-based Modeling | SPIRE 1998 |

Explore & Download

Productivity Tools

Sciweavers