Sciweavers

KCAP
2011
ACM

Language resources extracted from Wikipedia

13 years 3 months ago
Language resources extracted from Wikipedia
Wikipedia provides an interesting amount of text for more than hundred languages. This also includes languages where no reference corpora or other linguistic resources are easily available. We have extracted background language models built from the content of Wikipedia in various languages. The models generated from Simple and English Wikipedia are compared to language models derived from other established corpora. The differences between the models in regard to term coverage, term distribution and correlation are described and discussed. We provide access to the full dataset and create visualizations of the language models that can be used exploratory. The paper describes the newly released dataset for 33 languages, and the services that we provide on top of them. Categories and Subject Descriptors I.2.7 [Natural Language Processing]: Language models; I.2.6 [Learning]: Knowledge acquisition General Terms Languages, Measurement
Denny Vrandecic, Philipp Sorg, Rudi Studer
Added 16 Sep 2011
Updated 16 Sep 2011
Type Journal
Year 2011
Where KCAP
Authors Denny Vrandecic, Philipp Sorg, Rudi Studer
Comments (0)