Large Scale Parallel Document Mining for Machine Translation

15 years 1 months ago

Download static.googleusercontent.com

A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an initial, low-quality batch translation. In contrast to other approaches which require specialized metadata, the system uses only the textual content of the documents. Results are presented for a corpus of over two billion web pages and for a large collection of digitized public-domain books.

Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, Moshe

Real-time Traffic

COLING 2010 | Computational Linguistics | Large Corpora | Low-quality Batch Translation | Parallel Text |

claim paper

» A Large Scale Distributed Syntactic Semantic and Lexical Language Model for Machine Transl...

» Creating SentenceAligned Parallel Text Corpora from a Large Archive of Potential Parallel ...

» Parallel Strands A Preliminary Investigation into Mining the Web for Bilingual Text

» Enabling scalability and performance in a large scale CMP environment

» An Empirical Study on Web Mining of Parallel Data

» PaDDMAS Parallel and Distributed Data Mining Application Suite

» Scaling up text classification for large file systems

» Enhanced Infrastructure for Creation and Collection of Translation Resources

Post Info
More Details (n/a)

Added	13 May 2011
Updated	13 May 2011
Type	Journal
Year	2010
Where	COLING
Authors	Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, Moshe Dubiner

Comments (0)

Sciweavers

Large Scale Parallel Document Mining for Machine Translation

COLING 2010 | Computational Linguistics | Large Corpora | Low-quality Batch Translation | Parallel Text |

Explore & Download

Productivity Tools

Sciweavers