Sciweavers

EDBT
2009
ACM

Efficient top-k count queries over imprecise duplicates

14 years 3 months ago
Efficient top-k count queries over imprecise duplicates
We propose efficient techniques for processing various TopK count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of the data needed for the answer -- a requirement in massive and evolving sources where batch deduplication is expensive. The non-local nature of the problem of partitioning data into duplicate groups, makes it challenging to filter only those tuples forming the K largest groups. We propose a novel method of successively collapsing and pruning records which yield an order of magnitude reduction in running time compared to deduplicating the entire data first. Second, we return multiple high scoring answers to handle situations where it is impossible to resolve if two records are indeed duplicates of each other. Since finding even the highest scoring deduplication is NP-hard, the existing approach is to deploy one of many variants of score-based clustering...
Sunita Sarawagi, Vinay S. Deshpande, Sourabh Kasli
Added 16 Aug 2010
Updated 16 Aug 2010
Type Conference
Year 2009
Where EDBT
Authors Sunita Sarawagi, Vinay S. Deshpande, Sourabh Kasliwal
Comments (0)