Sciweavers

LREC
2010

A Corpus Factory for Many Languages

14 years 27 days ago
A Corpus Factory for Many Languages
For many languages there are no large, general-language corpora available. Until the web, all but the richest institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a `corpus factory' where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. The corpora we have developed are available for use in the Sketch Engine corpus query tool.
Adam Kilgarriff, Siva Reddy, Jan Pomikálek,
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where LREC
Authors Adam Kilgarriff, Siva Reddy, Jan Pomikálek, P. V. S. Avinesh
Comments (0)