Text Classification from Labeled and Unlabeled Documents using EM

15 years 6 months ago

Download www.kamalnigam.com

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve cl...

Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom

Real-time Traffic

Available Labeled Documents | Machine Learning | ML 2000 | Unlabeled Data | Unlabeled Documents |

claim paper

» Employing EM and PoolBased Active Learning for Text Classification

» Learning to Classify Texts Using Positive and Unlabeled Data

» Learning to Classify Text from Labeled and Unlabeled Documents

» Text Classification by Labeling Words

» A model for handling approximate noisy or incomplete labeling in text classification

» Selftaught learning transfer learning from unlabeled data

» A parallel learning algorithm for text classification

» Combining Labeled and Unlabeled Data for MultiClass Text Categorization

Post Info
More Details (n/a)

Added	19 Dec 2010
Updated	19 Dec 2010
Type	Journal
Year	2000
Where	ML
Authors	Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom M. Mitchell

Comments (0)

Sciweavers

Text Classification from Labeled and Unlabeled Documents using EM

Available Labeled Documents | Machine Learning | ML 2000 | Unlabeled Data | Unlabeled Documents |

Explore & Download

Productivity Tools

Sciweavers