Language resources extracted from Wikipedia

13 years 5 months ago

Download www.aifb.kit.edu

Wikipedia provides an interesting amount of text for more than hundred languages. This also includes languages where no reference corpora or other linguistic resources are easily available. We have extracted background language models built from the content of Wikipedia in various languages. The models generated from Simple and English Wikipedia are compared to language models derived from other established corpora. The diﬀerences between the models in regard to term coverage, term distribution and correlation are described and discussed. We provide access to the full dataset and create visualizations of the language models that can be used exploratory. The paper describes the newly released dataset for 33 languages, and the services that we provide on top of them. Categories and Subject Descriptors I.2.7 [Natural Language Processing]: Language models; I.2.6 [Learning]: Knowledge acquisition General Terms Languages, Measurement

Denny Vrandecic, Philipp Sorg, Rudi Studer

Real-time Traffic

English Wikipedia | Information Technology | KCAP 2011 | Language Models | Natural Language Processing |

claim paper

Post Info
More Details (n/a)

Added	16 Sep 2011
Updated	16 Sep 2011
Type	Journal
Year	2011
Where	KCAP
Authors	Denny Vrandecic, Philipp Sorg, Rudi Studer

Comments (0)

Sciweavers

Language resources extracted from Wikipedia

English Wikipedia | Information Technology | KCAP 2011 | Language Models | Natural Language Processing |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers