Many applications require analyzing vast amounts of textual data, but the size and inherent noise of such data can make processing very challenging. One approach to these issues is to mathematically reduce the data so as to represent each document using only a few dimensions. Techniques for performing such “dimensionality reduction” (DR) have been well-studied for geometric and numerical data, but more rarely applied to text. In this paper, we examine the impact of five DR techniques on the accuracy of two supervised classifiers on three textual sources. This task mirrors important real world problems, such as classifying web pages or scientific articles. In addition, the accuracy serves as a proxy measure for how well each DR technique preserves the inter-document relationships while vastly reducing the size of the data, facilitating more sophisticated analysis. We show that, for a fixed number of dimensions, DR can be very successful at improving accuracy compared to using t...
David G. Underhill, Luke McDowell, David J. Marche