First generation Web-content encodes information in handwritten (HTML) Web pages. Second generation Web content generates HTML pages on demand, e.g. by filling in templates with c...
Jacco van Ossenbruggen, Joost Geurts, Frank Cornel...
EuroGOV is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. EuroGOV is a collection of web pages crawl...
We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content ...
This paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from larges...
The success of the Semantic Web crucially depends on the existence of Web pages that provide machine-understandable meta-data. This meta-data is typically added in the semantic an...