Compressing term positions in web indexes

16 years 1 months ago

Download cis.poly.edu

Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inverted index compression. We focus on techniques for compressing term positions in web search engine indexes. Most previous work has focused on compressing docID and frequency data, or position information in other types of text collections. Compression of term positions in web pages is complicated by the fact that term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering eﬀects. Also, typical access patterns for position data are diﬀerent from those for docID and frequency data. We perform a detailed study of a number of existing and new techniques for compressing position data in web indexes. We also study how to eﬃcie...

Hao Yan, Shuai Ding, Torsten Suel

Real-time Traffic