Mining Relevant Text from Unlabelled Documents

15 years 12 months ago

Download cs.gmu.edu

Automatic classiﬁcation of documents is an important area of research with many applications in the ﬁelds of document searching, forensics and others. Methods to perform classiﬁcation of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, obtaining this sample may not be an easy (or even possible) task. In this paper we focus on the classiﬁcation of unlabelled documents into two classes: relevant and irrelevant, given a topic of interest. By dividing the set of documents into buckets (for instance, answers returned by different search engines), and using association rule mining to ﬁnd common sets of words among the buckets, we can efﬁciently obtain a sample of documents that has a large percentage of relevant ones. This sample can be used to train models to classify the entire set of documents. We prove, via experimentation, that our method is capable of ﬁltering relevant documents even in adverse conditions wher...

Daniel Barbará, Carlotta Domeniconi, Ning K

Real-time Traffic