Sciweavers

PVLDB
2008

Hashed samples: selectivity estimators for set similarity selection queries

13 years 11 months ago
Hashed samples: selectivity estimators for set similarity selection queries
We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, we study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive realtime sampling is more expensive than exact solutions (both in CPU and I/O cost). We show how to build robust samples a priori, based on existing synopses for distinct value estimation. We prove the accuracy of our technique theoretically, and verify its performance experimentally. Our algorithm is orders of magnitude faster than exact solutions and has very small space overhead.
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas,
Added 28 Dec 2010
Updated 28 Dec 2010
Type Journal
Year 2008
Where PVLDB
Authors Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava
Comments (0)