In this paper, we identify and analyze structural properties which reflect the functionality of a Web site. These structural properties consider the size, the organization, the co...
Geographical gazetteers are necessary in a wide variety of applications. In the past, the construction of such gazetteers has been a tedious, manual process and only recently have...
Adrian Popescu, Gregory Grefenstette, Houda Bouamo...
Abstract. In this paper, we describe a new approach to information extraction that neatly integrates top-down hypothesis driven information with bottom-up data driven information. ...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cleaning arbitrary web pages with the goal of extracting a corpus from web data, ...
The AutoFeed system automatically extracts data from semistructured web sites. Previously, researchers have developed two types of supervised learning approaches for extracting we...