Efficient Record Linkage using a Double Embedding Scheme

15 years 4 months ago

Download www.bibalex.org

Record linkage is the problem of identifying similar records across different data sources. The similarity between two records is defined based on domain-specific similarity functions over several attributes. In this paper, a novel approach is proposed that uses a two level matching based on double embedding. First, records are embedded into a metric space of dimension K, then they are embedded into a smaller dimension K . The first matching phase operates on the K vectors, performing a quick-and-dirty comparison, pruning a large number of true negatives while ensuring a high recall. Then a more accurate matching phase is performed on the matching pairs in the K-dimension. Experiments have been conducted on real data sets and results revealed a gain in time performance ranging from 30% to 60% while achieving the same level of recall and accuracy as in previous single embedding schemes. Keywords- data cleaning; similarity matching; record linkage; embedding schemes

Noha Adly

Real-time Traffic