Hashed samples: selectivity estimators for set similarity selection queries

15 years 7 months ago

Download www.yorku.ca

We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, we study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive realtime sampling is more expensive than exact solutions (both in CPU and I/O cost). We show how to build robust samples a priori, based on existing synopses for distinct value estimation. We prove the accuracy of our technique theoretically, and verify its performance experimentally. Our algorithm is orders of magnitude faster than exact solutions and has very small space overhead.

Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas,

Real-time Traffic

Priori Constructed Samples | PVLDB 2008 | Random Sampling | Selectivity Estimation Techniques |

claim paper

» Robust variable selection using least angle regression and elemental set sampling

» Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

» Selectivity Estimation for Boolean Queries

» Selectivity Estimation of High Dimensional Window Queries via Clustering

» Robust Selective Sampling from Single and Multiple Teachers

» Rapid Object Indexing Using Locality Sensitive Hashing and Joint 3DSignature Space Estimat...

» When one Sample is not Enough Improving Text Database Selection Using Shrinkage

» Selecting Distinctive 3D Shape Descriptors for Similarity Retrieval

Post Info
More Details (n/a)

Added	28 Dec 2010
Updated	28 Dec 2010
Type	Journal
Year	2008
Where	PVLDB
Authors	Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava

Comments (0)

Sciweavers

Hashed samples: selectivity estimators for set similarity selection queries

Priori Constructed Samples | PVLDB 2008 | Random Sampling | Selectivity Estimation Techniques |

Explore & Download

Productivity Tools

Sciweavers