CACTUS - Clustering Categorical Data Using Summaries

14 years 11 months ago

Download www.cs.cornell.edu

Clustering is an important data mining problem. Most of the earlier work on clustering focussed on numeric attributes which have a natural ordering on their attribute values. Recently, clustering data with categorical attributes, whose attribute values do not have a natural ordering, has received some attention. However, previous algorithms do not give a formal description of the clusters they discover and some of them assume that the user post-processes the output of the algorithm to identify the ﬁnal clusters. In this paper, we introduce a novel formalization of a cluster for categorical attributes by generalizing a deﬁnition of a cluster for numerical attributes. We then describe a very fast summarizationbased algorithm called CACTUS that discovers exactly such clusters in the data. CACTUS has two important characteristics. First, the algorithm requires only two scans of the dataset, and hence is very fast and scalable. Our experiments on a variety of datasets show that CACTUS ...

Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishn

Real-time Traffic