This paper investigates the use of supervised clustering in order to create sets of categories for classi cation of documents. We use information from a pre-existing taxonomy in order to supervise the creation of a set of related clusters, though with some freedom in de ning and creating the classes. We show that the advantage of using supervised clustering is that it is possible to have some control over the range of subjects that one would like the categorization system to address, but with a precise mathematical de nition of each category. We then categorize documents using this a priori knowledge of the de nition of each category. We also discuss a new technique to help the classi er distinguish better among closely related clusters. Finally, we show empirically that this categorization system utilizing a machine-derived taxonomy performs as well as a manual categorization process, but at a far lower cost.
Charu C. Aggarwal, Stephen C. Gates, Philip S. Yu