Sciweavers

KI
2006
Springer

Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords

13 years 11 months ago
Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords
In this paper, we examine the performance of the two policies for keyword selection over standard document corpora of varying properties. While in corpus-based policy a single set of keywords is selected for all classes globally, in class-based policy a distinct set of keywords is selected for each class locally. We use SVM as the learning method and perform experiments with boolean and tf-idf weighting. In contrast to the common belief, we show that using keywords instead of all words generally yields better performance and tf-idf weighting does not always outperform boolean weighting. Our results reveal that corpus-based approach performs better for large number of keywords while class-based approach performs better for small number of keywords. In skewed datasets, class-based keyword selection performs consistently better than corpus-based approach in terms of macro-averaged Fmeasure. In homogenous datasets, performances of class-based and corpusbased approaches are similar except f...
Arzucan Özgür, Tunga Güngör
Added 14 Dec 2010
Updated 14 Dec 2010
Type Journal
Year 2006
Where KI
Authors Arzucan Özgür, Tunga Güngör
Comments (0)