Using structural contexts to compress semistructured text collections

15 years 2 months ago

Download www.dcc.uchile.cl

We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each diﬀerent structure type (e.g., diﬀerent XML tag). The intuition behind SCM is that the distribution of all the texts that belong to a given structure type should be similar, and diﬀerent from that of other structure types. We mainly focus on semistatic models, and test our idea using a word-based Huﬀman method. This is the standard for compressing large natural language text databases, because random access, partial decompression, and direct search of the compressed collection is possible. This variant, dubbed SCMHuﬀ, retains those features and improves Huﬀman’s compression ratios. We consider the possibility that storing separate models may not pay oﬀ if the distribution of diﬀerent structure t...

Joaquín Adiego, Gonzalo Navarro, Pablo de l

Real-time Traffic