Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents

15 years 8 months ago

Download www.lrec-conf.org

We describe the compilation of a large corpus of French-Dutch sentence pairs from official Belgian documents which are available in the online version of the publication Belgisch Staatsblad/Moniteur belge, and which have been published between 1997 and 2006. After downloading files in batch, we filtered out documents which have no translation in the other language, documents which contain several languages (by checking on discriminating words), and pairs of documents with a substantial difference in length. We segmented the documents into sentences and aligned the latter, which resulted in 5 million sentence pairs (only one-to-one links were included in the parallel corpus); there are 2.4 million unique pairs. Sample-based evaluation of the sentence alignment results indicates a near 100% accuracy, which can be explained by the text genre, the procedure filtering out weakly parallel articles and the restriction to one-to-one links. The corpus is larger than a number of well-known Fren...

Tom Vanallemeersch

Real-time Traffic

Documents | Education | French-Dutch Sentence Pairs | LREC 2010 | Sentence Pairs |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2010
Where	LREC
Authors	Tom Vanallemeersch

Comments (0)

Sciweavers

Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents

Documents | Education | French-Dutch Sentence Pairs | LREC 2010 | Sentence Pairs |

Explore & Download

Productivity Tools

Sciweavers