Sciweavers

IR
2007

Lightweight natural language text compression

13 years 11 months ago
Lightweight natural language text compression
Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11% larger compressed files. This work describes End-Tagged Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60% faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code...
Nieves R. Brisaboa, Antonio Fariña, Gonzalo
Added 15 Dec 2010
Updated 15 Dec 2010
Type Journal
Year 2007
Where IR
Authors Nieves R. Brisaboa, Antonio Fariña, Gonzalo Navarro, José R. Paramá
Comments (0)