A Corpus Factory for Many Languages

15 years 8 months ago

Download web2py.iiit.ac.in

For many languages there are no large, general-language corpora available. Until the web, all but the richest institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a `corpus factory' where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. The corpora we have developed are available for use in the Sketch Engine corpus query tool.

Adam Kilgarriff, Siva Reddy, Jan Pomikálek,

Real-time Traffic

Education | General-language Corpora | Large Corpora | LREC 2010 | Richest Institutions |

claim paper

» Approximate Inference in Additive Factorial HMMs with Application to Energy Disaggregation

» The Cambridge CookieTheft Corpus A Corpus of Directed and Spontaneous Speech of BrainDamag...

» Unleashing the killer corpus experiences in creating the multieverything AMI Meeting Corpu...

» Web Text Corpus for Natural Language Processing

» A Speech Corpus for Modeling Language Acquisition CAREGIVER

» The JRCAcquis A multilingual aligned parallel corpus with 20 languages

» Sign Language Corpus Annotation toward a new Methodology

» Collecting an American Sign Language Corpus through the Participation of Native Signers

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2010
Where	LREC
Authors	Adam Kilgarriff, Siva Reddy, Jan Pomikálek, P. V. S. Avinesh

Comments (0)

Sciweavers

A Corpus Factory for Many Languages

Education | General-language Corpora | Large Corpora | LREC 2010 | Richest Institutions |

Explore & Download

Productivity Tools

Sciweavers