Sciweavers

KDD
2005
ACM

On the use of linear programming for unsupervised text classification

15 years 25 days ago
On the use of linear programming for unsupervised text classification
We propose a new algorithm for dimensionality reduction and unsupervised text classification. We use mixture models as underlying process of generating corpus and utilize a novel, L1-norm based approach introduced by Kleinberg and Sandler [19]. We show that our algorithm performs extremely well on large datasets, with peak accuracy approaching that of supervised learning based on Support Vector Machines with large training sets. The method is based on the same idea that underlies Latent Semantic Indexing (LSI). We find a good low-dimensional subspace of a feature space and project all documents into it. However our projection minimizes different error, and unlike LSI we build a basis, that in many cases corresponds to the actual topics. We present the testing results of rithm on the abstracts of arXiv- an electronic repository of scientific papers, and the 20 Newsgroup dataset - a small snapshot of 20 specific newsgroups. Categories and Subject Descriptors H.3.3 [Information Storage a...
Mark Sandler
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2005
Where KDD
Authors Mark Sandler
Comments (0)