Sciweavers

PAKDD
2015
ACM

Leveraging the Common Cause of Errors for Constraint-Based Data Cleansing

8 years 7 months ago
Leveraging the Common Cause of Errors for Constraint-Based Data Cleansing
This study describes a statistically motivated approach to constraint-based data cleansing that derives the cause of errors from a distribution of conflicting tuples. In real-world dirty data, errors are often not randomly distributed. Rather, they often occur only under certain conditions, such as when the transaction is handled by a certain operator, or the weather is rainy. Leveraging such common conditions, or “cause conditions”, the algorithm resolves multi-tuple conflicts with high speed, as well as high accuracy in realistic settings where the distribution of errors is skewed. We present complexity analyses of the problem, pointing out two subproblems that are NP-complete. We then introduce, for each subproblem, heuristics that work in sub-polynomial time. The algorithms are tested with three sets of data and rules. The experiments show that, compared to the state-of-the-art methods for Conditional Functional Dependencies (CFD)-based and FD-based data cleansing, the propos...
Ayako Hoshino, Hiroki Nakayama, Chihiro Ito, Kyota
Added 16 Apr 2016
Updated 16 Apr 2016
Type Journal
Year 2015
Where PAKDD
Authors Ayako Hoshino, Hiroki Nakayama, Chihiro Ito, Kyota Kanno, Kenshi Nishimura
Comments (0)