Kernel functions can be viewed as a non-linear transformation that increases the separability of the input data by mapping them to a new high dimensional space. The incorporation of kernel function enables the K-Means algorithm to explore the inherent data pattern in the new space. However, the recent applications of kernel KMeans algorithm are confined to small corpora due to its expensive computation and storage cost. To overcome these obstacles, we propose a new clustering scheme which changes the clustering order from the sequence of samples to the sequence of kernels, and employs a diskbased strategy to control data. The new clustering scheme has been demonstrated to be very efficient for large corpus by our experiments on hand-written digits recognition, in which more than 90% of the running time was saved.
Rong Zhang, Alexander I. Rudnicky