Set Similarity Join on Probabilistic Data

15 years 5 months ago

Download www.comp.nus.edu.sg

Set similarity join has played an important role in many real-world applications such as data cleaning, near duplication detection, data integration, and so on. In these applications, set data often contain noises and are thus uncertain and imprecise. In this paper, we model such probabilistic set data on two uncertainty levels, that is, set and element levels. Based on them, we investigate the problem of probabilistic set similarity join (PS2 J) over two probabilistic set databases, under the possible worlds semantics. To efﬁciently process the PS2 J operator, we ﬁrst reduce our problem by condensing the possible worlds, and then propose effective pruning techniques, including Jaccard distance pruning, probability upper bound pruning, and aggregate pruning, which can ﬁlter out false alarms of probabilistic set pairs, with the help of indexes and our designed synopses. We demonstrate through extensive experiments the PS2 J processing performance on both real and synthetic data.

Xiang Lian, Lei Chen 0002

Real-time Traffic