Many information integration tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the computational cost of estimating similarity between all pairs of instances grows quadratically with the size of the input dataset, computing similarity between all object pairs is impractical and becomes prohibitive for large datasets and complex similarity functions, preventing scaling record linkage to large datasets. Blocking methods alleviate this problem by efficiently selecting a subset of object pairs for which similarity is computed, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing a similarity function or a set of predicates followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for training blocking functions to be efficient and accura...
Mikhail Bilenko, Beena Kamath, Raymond J. Mooney