Long-tail Vocabulary Dictionary Extraction from the Web

10 years 3 months ago

Download web.eecs.umich.edu

A dictionary — a set of instances belonging to the same conceptual class — is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-speciﬁc extractors. We use webpage-speciﬁc structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-speciﬁc extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the ...

Zhe Chen, Michael J. Cafarella, H. V. Jagadish

Real-time Traffic