Sampling dirty data for matching attributes

15 years 11 months ago

Download www.itee.uq.edu.au

We investigate the problem of creating and analyzing samples of relational databases to ﬁnd relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often ‘dirty’, especially when integrating data from diﬀerent sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures eﬀective, we develop eﬃcient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoﬀ between accuracy and speed. This motivates a two-stage ﬁltering approach, with both measures operating on the same samples. Catego...

Henning Köhler, Xiaofang Zhou, Shazia Wasim S

Real-time Traffic

Database | Real-world Data Sets | SIGMOD 2010 | Similarity Computation | Value Sets |

claim paper

» Classifying transformationvariant attributed point patterns

» Automatic Data Fusion with HumMer

» A Scheme for Approximate Matching Event Announcements to a Customer Database

» Making holistic schema matching robust an ensemble approach

» Detecting Changes in XML Documents

» Does Knowledge Management Pay Off

» Indexing by Shape of Image Databases Based on Extended Grid Files

Post Info
More Details (n/a)

Added	18 Jul 2010
Updated	18 Jul 2010
Type	Conference
Year	2010
Where	SIGMOD
Authors	Henning Köhler, Xiaofang Zhou, Shazia Wasim Sadiq, Yanfeng Shu, Kerry L. Taylor

Comments (0)

Sciweavers

Sampling dirty data for matching attributes

Database | Real-world Data Sets | SIGMOD 2010 | Similarity Computation | Value Sets |

Explore & Download

Productivity Tools

Sciweavers