Researchers in the data mining area frequently have to spend significant portion of their time on preprocessing the data in order to apply their algorithms to real-world datasets. Many real-world datasets are not perfect: they contain missing, erroneous, duplicate data and other problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results (“garbage in, garbage out” principle). Therefore data cleaning techniques should be applied in-advance to the data to ensure high quality of the results. In this paper we address a data cleaning challenge called object consolidation. This challenge arises because often objects in datasets are represented via descriptions (a set of instantiated attributes) which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for ...
Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotr