Sciweavers

ICDE
2010
IEEE

ProbClean: A probabilistic duplicate detection system

14 years 6 months ago
ProbClean: A probabilistic duplicate detection system
— One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.
George Beskales, Mohamed A. Soliman, Ihab F. Ilyas
Added 17 May 2010
Updated 17 May 2010
Type Conference
Year 2010
Where ICDE
Authors George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, Shai Ben-David, Yubin Kim
Comments (0)