In this paper we present a method for clustering SAGE (Serial Analysis of Gene Expression) data to detect similarities and dissimilarities between different types of cancer on the subcellular level. The data, however, is extremely high dimensional, and due to the method of measurement, there are many errors as well as missing values in the data, challenging any clustering algorithm. Therefore, we introduce special pre-processing techniques to reduce these errors and to restore missing data. These techniques are tailored to the process that generates the data, making only very conservative changes. Furthermore, we present a new subspace selection technique to identify a relevant subset of attributes (genes) using the Wilcoxon test. This is a general technique that can be applied to select subspaces for the purpose of clustering whenever some high-level categories of interest are known for the data (such as cancerous and noncancerous). Finally, we discuss the results of the application ...
Jörg Sander, Monica C. Sleumer, Raymond T. Ng