Aim of this work is to apply a novel comprehensive machine learning tool for data mining to preprocessing and interpretation of gene expression data. Furthermore, some visualization facilities are provided. The data mining framework consists of two main parts: preprocessing and clustering-agglomerating phases. To the first phase belong a noise filtering procedure and a non-linear PCA Neural Network for feature extraction. The second phase is used to accomplish an unsupervised clustering based on a hierarchy of two approaches: a Probabilistic Principal Surfaces to obtain the rough regions of interesting points and a FisherNegentropy information based approach to agglomerate the regions previously found in order to discover substructures present in the data. Experiments on gene microarray data are made. Several experiments are shown varying the threshold, needed by the agglomerative clustering, to understand the structure of the analyzed data set.
Roberto Amato, Angelo Ciaramella, N. Deniskina, Ca