From D-Coi to SoNaR: a reference corpus for Dutch

15 years 8 months ago

Download www.lrec-conf.org

The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully...

Nelleke Oostdijk, Martin Reynaert, Paola Monachesi

Real-time Traffic

500-million-word Reference Corpus | Education | LREC 2008 | Major Reference Corpus | Reference Corpus |

claim paper

» Balancing SoNaR IPR versus Processing Issues in a 500MillionWord Written Dutch Reference C...

» Towards a Balanced Named Entity Corpus for Dutch

» The DTUNA Corpus A Dutch Dataset for the Evaluation of Referring Expression Generation Alg...

» Automatic phonetic transcription of large speech corpora

» Evaluation of a Machine Translation System for Low Resource Languages METISII

» Structural Equations in Language Learning

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Nelleke Oostdijk, Martin Reynaert, Paola Monachesi, Gertjan van Noord, Roeland Ordelman, Ineke Schuurman, Vincent Vandeghinste

Comments (0)

Sciweavers

From D-Coi to SoNaR: a reference corpus for Dutch

500-million-word Reference Corpus | Education | LREC 2008 | Major Reference Corpus | Reference Corpus |

Explore & Download

Productivity Tools

Sciweavers