On Finding Similar Items in a Stream of Transactions

14 years 22 days ago

Download www.itu.dk

While there has been a lot of work on finding frequent itemsets in transaction data streams, none of these solve the problem of finding similar pairs according to standard similarity measures. This paper is a first attempt at dealing with this, arguably more important, problem. We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent itemsets: Any algorithm that (even only approximately and with a chance of error) finds the most frequent k-itemset must use space (min{mb, nk , (mb/)k }) bits, where mb is the number of items in the stream so far, n is the number of distinct items and is a support threshold. To achieve any non-trivial space upper bound we must thus abandon a worstcase assumption on the data stream. We work under the model that the transactions come in random order, and show that surprisingly, not only is small-space similarity mining possible for the most common similari...

Andrea Campagna, Rasmus Pagh

Real-time Traffic

Data Mining | Frequent Itemsets | ICDM 2010 | Support Threshold | Upper Bound |

claim paper

Post Info
More Details (n/a)

Added	12 Feb 2011
Updated	12 Feb 2011
Type	Journal
Year	2010
Where	ICDM
Authors	Andrea Campagna, Rasmus Pagh

Comments (0)

Sciweavers

On Finding Similar Items in a Stream of Transactions

Data Mining | Frequent Itemsets | ICDM 2010 | Support Threshold | Upper Bound |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers