Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

14 years 6 months ago

Download www.vldb2005.org

Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964.” Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called Sepia, to solve the problem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full speciﬁcation of the tech...

Liang Jin, Chen Li

Real-time Traffic

Database | Fuzzy String Predicates | Global Histogram | Histogram Structures | VLDB 2005 |

claim paper

Post Info
More Details (n/a)

Added	28 Jun 2010
Updated	28 Jun 2010
Type	Conference
Year	2005
Where	VLDB
Authors	Liang Jin, Chen Li

Comments (0)

Sciweavers

Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

Database | Fuzzy String Predicates | Global Histogram | Histogram Structures | VLDB 2005 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers