Search Sciweavers | Sciweavers

468 search results - page 32 / 94

» Automatic Data Extraction from Data-Rich Web Pages

220

click to vote

WEBDB
1999
Springer

196views Database» more WEBDB 1999»

Web Ecology: Recycling HTML Pages as XML Documents Using W4F

15 years 11 months ago

Download db.cis.upenn.edu

In this paper we present the World-Wide Web Wrapper Factory (W4F), a Java toolkit to generate wrappers for Web data sources. Some key features of W4F are an expressive language to...

Arnaud Sahuguet, Fabien Azavant

claim paper

Read More »

189

click to vote

ECIR
2006
Springer

143views Information Technology» more ECIR 2006»

Automatic Acquisition of Chinese-English Parallel Corpus from the Web

15 years 8 months ago

Download research.microsoft.com

Parallel corpora are a valuable resource for tasks such as cross-language information retrieval and data-driven natural language processing systems. Previously only small scale cor...

Ying Zhang, Ke Wu, Jianfeng Gao, Phil Vines

claim paper

Read More »

227

click to vote

AIRWEB
2007
Springer

214views Internet Technology» more AIRWEB 2007»

Extracting Link Spam using Biased Random Walks from Spam Seed Sets

16 years 1 months ago

Download airweb.cse.lehigh.edu

Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such ...

Baoning Wu, Kumar Chellapilla

claim paper

Read More »

231

click to vote

JCDL
2004
ACM

198views Education» more JCDL 2004»

Finding authoritative people from the web

16 years 24 days ago

Download www.ingrid.org

Today’s web is so huge and diverse that it arguably reﬂects the real world. For this reason, searching the web is a promising approach to ﬁnd things in the real world. This ...

Masanori Harada, Shin-ya Sato, Kazuhiro Kazama

claim paper

Read More »

173

Voted

LREC
2010

216views Education» more LREC 2010»

BlogBuster: A Tool for Extracting Corpora from the Blogosphere

15 years 8 months ago

Download www.lrec-conf.org

This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cleaning arbitrary web pages with the goal of extracting a corpus from web data, ...

Georgios Petasis, Dimitrios Petasis

claim paper

Read More »

« Prev « First page 32 / 94 Last » Next »

Sciweavers

Explore & Download

Productivity Tools

Sciweavers