

EuroGOV: Engineering a Multilingual Web Corpus

14 years 8 months ago
EuroGOV: Engineering a Multilingual Web Corpus
EuroGOV is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. EuroGOV is a collection of web pages crawled from the European Union portal, European Union member state governmental web sites, and Russian government web sites. The corpus contains over 3 million documents written in more than 20 different European languages. In this paper we provide a detailed description of the EuroGOV collection. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries General Terms Measurement, Performance, Experimentation Keywords Multilingual web corpus, Web retrieval, Multilingual retrieval
Börkur Sigurbjörnsson, Jaap Kamps, Maart
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where CLEF
Authors Börkur Sigurbjörnsson, Jaap Kamps, Maarten de Rijke
Comments (0)