Canonicalization of database records using adaptive similarity measures

16 years 7 months ago

Download www.cs.umass.edu

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from online papers. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is little existing work on canonicalization. In this paper, we explore the use of edit distance measures to construct a canonical representation that is "central" in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore,...

Aron Culotta, Michael L. Wick, Robert Hall, Matthe

Real-time Traffic

Canonical Representation | Data Mining | Edit Distance Measures | KDD 2007 | Kdd Versus Conference |

claim paper

» Adaptive Product Normalization Using Online Learning for Record Linkage in Comparison Shop...

» Surrogate Ranking for Very Expensive Similarity Queries

» Timedependent semantic similarity measure of queries using historical clickthrough data

» Extracting data records from the web using tag path clustering

» Iterative record linkage for cleaning and integration

» Combining Approximation Techniques and Vector Quantization for Adaptable Similarity Search

Post Info
More Details (n/a)

Added	30 Nov 2009
Updated	30 Nov 2009
Type	Conference
Year	2007
Where	KDD
Authors	Aron Culotta, Michael L. Wick, Robert Hall, Matthew Marzilli, Andrew McCallum

Comments (0)

Sciweavers

Canonicalization of database records using adaptive similarity measures

Canonical Representation | Data Mining | Edit Distance Measures | KDD 2007 | Kdd Versus Conference |

Explore & Download

Productivity Tools

Sciweavers