Sciweavers

WSDM
2016
ACM

Long-tail Vocabulary Dictionary Extraction from the Web

8 years 6 months ago
Long-tail Vocabulary Dictionary Extraction from the Web
A dictionary — a set of instances belonging to the same conceptual class — is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. In this paper, we develop a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the ...
Zhe Chen, Michael J. Cafarella, H. V. Jagadish
Added 12 Apr 2016
Updated 12 Apr 2016
Type Journal
Year 2016
Where WSDM
Authors Zhe Chen, Michael J. Cafarella, H. V. Jagadish
Comments (0)