Sciweavers

KDD
2007
ACM

Canonicalization of database records using adaptive similarity measures

14 years 12 months ago
Canonicalization of database records using adaptive similarity measures
It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from online papers. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is little existing work on canonicalization. In this paper, we explore the use of edit distance measures to construct a canonical representation that is "central" in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore,...
Aron Culotta, Michael L. Wick, Robert Hall, Matthe
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2007
Where KDD
Authors Aron Culotta, Michael L. Wick, Robert Hall, Matthew Marzilli, Andrew McCallum
Comments (0)