Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

15 years 8 months ago

Download www.lrec-conf.org

In The Low Countries, a major reference corpus for written Dutch is currently being built. In this paper, we discuss the interplay between data acquisition and data processing during the creation of the SoNaR Corpus. Based on recent developments in traditional corpus compiling and new web harvesting approaches, SoNaR is designed to contain 500 million words, balanced over 36 text types including both traditional and new media texts. Beside its balanced design, every text sample included in SoNaR will have its IPR issues settled to the largest extent possible. This data collection task presents many challenges because every decision taken on the level of text acquisition has ramifications for the level of processing and the general usability of the corpus later on. As far as the traditional text types are concerned, each text brings its own processing requirements and issues. For new media texts - SMS, chat - the problem is even more complex, issues such as anonimity, recognizability a...

Martin Reynaert, Nelleke Oostdijk, Orphée D

Real-time Traffic

Education | LREC 2010 | Sonar | SoNaR Corpus | Text Types |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2010
Where	LREC
Authors	Martin Reynaert, Nelleke Oostdijk, Orphée De Clercq, Henk van den Heuvel, Franciska de Jong

Comments (0)

Sciweavers

Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

Education | LREC 2010 | Sonar | SoNaR Corpus | Text Types |

Explore & Download

Productivity Tools

Sciweavers