Accuracy of Approximate String Joins Using Grams

15 years 8 months ago

Download queens.db.toronto.edu

Approximate join is an important part of many data cleaning and integration methodologies. Various similarity measures have been proposed for accurate and eﬃcient matching of string attributes. The accuracy of the similarity measures highly depends on the characteristics of the data such as amount and type of the errors and length of the strings. Recently, there has been an increasing interest in using methods based on q-grams (substrings of length q) made out of the strings, mainly due to their high eﬃciency. In this work, we evaluate the accuracy of the similarity measures used in these methodologies. We present an overview of several similarity measures based on q-grams. We then thoroughly compare their accuracy on several datasets with diﬀerent characteristics. Since the eﬃciency of approximate joins depend on the similarity threshold they use, we study how the value of the threshold (including values used in recent performance studies) eﬀects the accuracy of the join. W...

Oktie Hassanzadeh, Mohammad Sadoghi, Renée

Real-time Traffic