Sciweavers

ML
2000
ACM

Text Classification from Labeled and Unlabeled Documents using EM

14 years 6 days ago
Text Classification from Labeled and Unlabeled Documents using EM
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve cl...
Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom
Added 19 Dec 2010
Updated 19 Dec 2010
Type Journal
Year 2000
Where ML
Authors Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom M. Mitchell
Comments (0)