Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

15 years 12 months ago

Download www.cs.technion.ac.il

Although text categorization is a burgeoning area of IR research, readily available test collections in this ﬁeld are surprisingly scarce. We describe a methodology and system (named Accio) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We deﬁne parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the diﬃculty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as eﬃcient heuristics for generating datasets subject to user’s requirements. A large collection of automatically generated datasets are made available for other...

Dmitry Davidov, Evgeniy Gabrilovich, Shaul Markovi

Real-time Traffic

Available Test Collections | Categorization | SIGIR 2004 | Text Categorization |

claim paper

Added	30 Jun 2010
Updated	30 Jun 2010
Type	Conference
Year	2004
Where	SIGIR
Authors	Dmitry Davidov, Evgeniy Gabrilovich, Shaul Markovitch

Sciweavers

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Available Test Collections | Categorization | SIGIR 2004 | Text Categorization |

Explore & Download

Productivity Tools

Sciweavers