Sciweavers

468 search results - page 32 / 94
» Automatic Data Extraction from Data-Rich Web Pages
Sort
View
WEBDB
1999
Springer
196views Database» more  WEBDB 1999»
13 years 12 months ago
Web Ecology: Recycling HTML Pages as XML Documents Using W4F
In this paper we present the World-Wide Web Wrapper Factory (W4F), a Java toolkit to generate wrappers for Web data sources. Some key features of W4F are an expressive language to...
Arnaud Sahuguet, Fabien Azavant
ECIR
2006
Springer
13 years 9 months ago
Automatic Acquisition of Chinese-English Parallel Corpus from the Web
Parallel corpora are a valuable resource for tasks such as cross-language information retrieval and data-driven natural language processing systems. Previously only small scale cor...
Ying Zhang, Ke Wu, Jianfeng Gao, Phil Vines
AIRWEB
2007
Springer
14 years 1 months ago
Extracting Link Spam using Biased Random Walks from Spam Seed Sets
Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such ...
Baoning Wu, Kumar Chellapilla
JCDL
2004
ACM
198views Education» more  JCDL 2004»
14 years 1 months ago
Finding authoritative people from the web
Today’s web is so huge and diverse that it arguably reflects the real world. For this reason, searching the web is a promising approach to find things in the real world. This ...
Masanori Harada, Shin-ya Sato, Kazuhiro Kazama
LREC
2010
216views Education» more  LREC 2010»
13 years 9 months ago
BlogBuster: A Tool for Extracting Corpora from the Blogosphere
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cleaning arbitrary web pages with the goal of extracting a corpus from web data, ...
Georgios Petasis, Dimitrios Petasis