Sciweavers

139 search results - page 20 / 28
» An Approach to Identify Duplicated Web Pages
Sort
View
WIDM
2006
ACM
14 years 1 months ago
Coarse-grained classification of web sites by their structural properties
In this paper, we identify and analyze structural properties which reflect the functionality of a Web site. These structural properties consider the size, the organization, the co...
Christoph Lindemann, Lars Littig
WEBI
2009
Springer
14 years 2 months ago
Mining a Multilingual Geographical Gazetteer from the Web
Geographical gazetteers are necessary in a wide variety of applications. In the past, the construction of such gazetteers has been a tedious, manual process and only recently have...
Adrian Popescu, Gregory Grefenstette, Houda Bouamo...
AUSAI
2003
Springer
14 years 28 days ago
Information Extraction via Path Merging
Abstract. In this paper, we describe a new approach to information extraction that neatly integrates top-down hypothesis driven information with bottom-up data driven information. ...
Robert Dale, Cécile Paris, Marc Tilbrook
LREC
2010
216views Education» more  LREC 2010»
13 years 9 months ago
BlogBuster: A Tool for Extracting Corpora from the Blogosphere
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cleaning arbitrary web pages with the goal of extracting a corpus from web data, ...
Georgios Petasis, Dimitrios Petasis
KCAP
2005
ACM
14 years 1 months ago
AutoFeed: an unsupervised learning system for generating webfeeds
The AutoFeed system automatically extracts data from semistructured web sites. Previously, researchers have developed two types of supervised learning approaches for extracting we...
Bora Gazen, Steven Minton