We introduced a novel method employing a hierarchical domain ontology structure to extract features representing documents in our previous publication (Wang 2002). All raw words in the training documents are mapped to concepts in a concept hierarchy derived from the domain ontology. Based on these concepts, a concept hierarchy is established for the training document space, using is-a relationships defined in the domain ontology. An optimum concept set may be obtained by searching the concept hierarchy with an appropriate heuristic function. This may be used as the feature space to represent the training dataset. The proposed method aims to solve some drawbacks suffered by text classification algorithms and feature selection algorithms. In this paper, we conducted a series of experiments to compare our approach with other comparable feature-selection and feature-extraction methods. The results indicated that our approach has advantages in many aspects.
Bill B. Wang, Robert I. McKay, Hussein A. Abbass,