Sciweavers

DMIN
2009

Efficient Record Linkage using a Double Embedding Scheme

13 years 10 months ago
Efficient Record Linkage using a Double Embedding Scheme
Record linkage is the problem of identifying similar records across different data sources. The similarity between two records is defined based on domain-specific similarity functions over several attributes. In this paper, a novel approach is proposed that uses a two level matching based on double embedding. First, records are embedded into a metric space of dimension K, then they are embedded into a smaller dimension K . The first matching phase operates on the K vectors, performing a quick-and-dirty comparison, pruning a large number of true negatives while ensuring a high recall. Then a more accurate matching phase is performed on the matching pairs in the K-dimension. Experiments have been conducted on real data sets and results revealed a gain in time performance ranging from 30% to 60% while achieving the same level of recall and accuracy as in previous single embedding schemes. Keywords- data cleaning; similarity matching; record linkage; embedding schemes
Noha Adly
Added 17 Feb 2011
Updated 17 Feb 2011
Type Journal
Year 2009
Where DMIN
Authors Noha Adly
Comments (0)