Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords

14 years 12 days ago

Download www.cmpe.boun.edu.tr

In this paper, we examine the performance of the two policies for keyword selection over standard document corpora of varying properties. While in corpus-based policy a single set of keywords is selected for all classes globally, in class-based policy a distinct set of keywords is selected for each class locally. We use SVM as the learning method and perform experiments with boolean and tf-idf weighting. In contrast to the common belief, we show that using keywords instead of all words generally yields better performance and tf-idf weighting does not always outperform boolean weighting. Our results reveal that corpus-based approach performs better for large number of keywords while class-based approach performs better for small number of keywords. In skewed datasets, class-based keyword selection performs consistently better than corpus-based approach in terms of macro-averaged Fmeasure. In homogenous datasets, performances of class-based and corpusbased approaches are similar except f...

Arzucan Özgür, Tunga Güngör

Real-time Traffic

Artificial Intelligence | Keyword Selection | Keywords | KI 2006 | Tf-idf Weighting |

claim paper

Post Info
More Details (n/a)

Added	14 Dec 2010
Updated	14 Dec 2010
Type	Journal
Year	2006
Where	KI
Authors	Arzucan Özgür, Tunga Güngör

Comments (0)

Sciweavers

Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords

Artificial Intelligence | Keyword Selection | Keywords | KI 2006 | Tf-idf Weighting |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers