Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content

15 years 8 months ago

Download cs.haifa.ac.il

Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containing manually translated texts. This algorithm was implemented and tested on Hebrew-English parallel texts. With properly selected thresholds, precision of 100% can be obtained.

Yulia Tsvetkov, Shuly Wintner

Real-time Traffic

Education | Hebrew-English Parallel Texts | LREC 2010 | Parallel Corpora | Parallel Document Pairs |

claim paper

» Recognition of Dialogue Acts in Multiparty Meetings Using a Switching DBN

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2010
Where	LREC
Authors	Yulia Tsvetkov, Shuly Wintner

Comments (0)

Sciweavers

Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content

Education | Hebrew-English Parallel Texts | LREC 2010 | Parallel Corpora | Parallel Document Pairs |

Explore & Download

Productivity Tools

Sciweavers