Sciweavers

SIGIR
2004
ACM

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

14 years 5 months ago
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory
Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named Accio) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user’s requirements. A large collection of automatically generated datasets are made available for other...
Dmitry Davidov, Evgeniy Gabrilovich, Shaul Markovi
Added 30 Jun 2010
Updated 30 Jun 2010
Type Conference
Year 2004
Where SIGIR
Authors Dmitry Davidov, Evgeniy Gabrilovich, Shaul Markovitch
Comments (0)