Sciweavers

DCC
2010
IEEE

A New Searchable Variable-to-Variable Compressor

14 years 7 months ago
A New Searchable Variable-to-Variable Compressor
Word-based compression over natural language text has shown to be a good choice to trade compression ratio and speed, obtaining compression ratios close to 30% and very fast decompression. Additionally, it permits fast searches over the compressed text using Boyer-Moore type algorithms. Such compressors are based on processing fixed source symbols (words) and assigning them variablebyte-length codewords, thus following a fixed-to-variable approach. We present a new variable-to-variable compressor (v2vdc) that uses words and phrases as the source symbols, which are encoded with a variable-length scheme. The phrases are chosen using the longest common prefix information on the suffix array of the text, so as to favor long and frequent phrases. We obtain compression ratios close to those of p7zip and ppmdi, overcoming bzip2, and 8-10 percentage points less than the equivalent word-based compressor. V2vdc is in addition among the fastest to decompress, and allows efficient direct searc...
Nieves R. Brisaboa, Antonio Fariña, Juan-Ra
Added 17 May 2010
Updated 17 May 2010
Type Conference
Year 2010
Where DCC
Authors Nieves R. Brisaboa, Antonio Fariña, Juan-Ramón López, Gonzalo Navarro, Eduardo R. Lopez
Comments (0)