Efficient top-k count queries over imprecise duplicates

15 years 10 months ago

Download www.it.iitb.ac.in

We propose efficient techniques for processing various TopK count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of the data needed for the answer -- a requirement in massive and evolving sources where batch deduplication is expensive. The non-local nature of the problem of partitioning data into duplicate groups, makes it challenging to filter only those tuples forming the K largest groups. We propose a novel method of successively collapsing and pruning records which yield an order of magnitude reduction in running time compared to deduplicating the entire data first. Second, we return multiple high scoring answers to handle situations where it is impossible to resolve if two records are indeed duplicates of each other. Since finding even the highest scoring deduplication is NP-hard, the existing approach is to deploy one of many variants of score-based clustering...

Sunita Sarawagi, Vinay S. Deshpande, Sourabh Kasli

Real-time Traffic

Batch Deduplication | Database | EDBT 2009 | Scoring Answers | TopK Count Queries |

claim paper

Added	16 Aug 2010
Updated	16 Aug 2010
Type	Conference
Year	2009
Where	EDBT
Authors	Sunita Sarawagi, Vinay S. Deshpande, Sourabh Kasliwal

Sciweavers

Efficient top-k count queries over imprecise duplicates

Batch Deduplication | Database | EDBT 2009 | Scoring Answers | TopK Count Queries |

Explore & Download

Productivity Tools

Sciweavers