Sciweavers

NIPS
2003

Learning the k in k-means

14 years 2 months ago
Learning the k in k-means
When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, ...
Greg Hamerly, Charles Elkan
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Where NIPS
Authors Greg Hamerly, Charles Elkan
Comments (0)