Sciweavers

ICDM
2003
IEEE

Scalable Model-based Clustering by Working on Data Summaries

14 years 4 months ago
Scalable Model-based Clustering by Working on Data Summaries
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed ExpectationMaximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar compu...
Huidong Jin, Man Leung Wong, Kwong-Sak Leung
Added 04 Jul 2010
Updated 04 Jul 2010
Type Conference
Year 2003
Where ICDM
Authors Huidong Jin, Man Leung Wong, Kwong-Sak Leung
Comments (0)