Sciweavers

498 search results - page 3 / 100
» Robust web content extraction
Sort
View
SIGIR
2008
ACM
13 years 6 months ago
SpotSigs: robust and efficient near duplicate detection in large web collections
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching sig...
Martin Theobald, Jonathan Siddharth, Andreas Paepc...
EJC
2009
13 years 4 months ago
A New Partial Information Extraction Method for Personal Mashup Construction
Nowadays more and more Web sites generate Web pages containing client-side scripts such as JavaScript and Flash instead of ordinary static HTML pages. These scripts create dynamic ...
Junxia Guo, Hao Han, Takehiro Tokuda
ICWE
2009
Springer
14 years 1 months ago
A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis
Abstract. The traditional Web news article contents extraction methods are time-costly and need much maintenance because they analyze the layout of news pages to generate the wrapp...
Hao Han, Takehiro Tokuda
SIGMOD
2010
ACM
232views Database» more  SIGMOD 2010»
13 years 6 months ago
Optimizing content freshness of relations extracted from the web using keyword search
An increasing number of applications operate on data obtained from the Web. These applications typically maintain local copies of the web data to avoid network latency in data acc...
Mohan Yang, Haixun Wang, Lipyeow Lim, Min Wang
APWEB
2010
Springer
13 years 4 months ago
ECON: An Approach to Extract Content from Web News Page
Abstract--This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news...
Yan Guo, Huifeng Tang, Linhai Song, Yu Wang 0009, ...