ProbClean: A probabilistic duplicate detection system

16 years 1 months ago

Download www.cs.uwaterloo.ca

— One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efﬁciently supports relational queries and allows new types of queries against a set of possible repairs.

George Beskales, Mohamed A. Soliman, Ihab F. Ilyas

Real-time Traffic

Database | Duplicate Detection | Duplicate Detection Algorithms | ICDE 2010 | Parameter Settings |

claim paper

» Large scale learning and recognition of faces in web videos

Post Info
More Details (n/a)

Added	17 May 2010
Updated	17 May 2010
Type	Conference
Year	2010
Where	ICDE
Authors	George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, Shai Ben-David, Yubin Kim

Comments (0)

Sciweavers

ProbClean: A probabilistic duplicate detection system

Database | Duplicate Detection | Duplicate Detection Algorithms | ICDE 2010 | Parameter Settings |

Explore & Download

Productivity Tools

Sciweavers