This paper explores topic aspect (i.e., subtopic or facet) classification for English and Chinese collections. The evaluation model assumes a bilingual user who has found documents on a topic and identified a few passages in each language on aspects of that topic. Additional passages are then automatically labeled using a k-Nearest-Neighbor classifier and local (i.e., result set) Latent Semantic Analysis. Experiments show that when few training examples are available in either language, classification using training examples from both languages can often achieve higher effectiveness than using training examples from just one language. When the total number of training examples is held constant, classification effectiveness correlates positively with the fraction of same-language training examples in the training set. These results suggest that supervised classification can benefit from hand-annotating a few same-language examples, and that when performing classification in bilingual c...
Yejun Wu, Douglas W. Oard