Using TectoMT as a Preprocessing Tool for Phrase-Based Statistical Machine Translation

15 years 5 months ago

Download ufal.mff.cuni.cz

Abstract. We present a systematic comparison of preprocessing techniques for two language pairs: English-Czech and English-Hindi. The two target languages, although both belonging to the Indo-European language family, show signiﬁcant diﬀerences in morphology, syntax and word order. We describe how TectoMT, a successful framework for analysis and generation of language, can be used as preprocessor for a phrasebased MT system. We compare the two language pairs and the optimal sets of source-language transformations applied to them. The following transformations are examples of possible preprocessing steps: lemmatization; retokenization, compound splitting; removing/adding words lacking counterparts in the other language; phrase reordering to resemble the target word order; marking syntactic functions. TectoMT, as well as all other tools and data sets we use, are freely available on the Web. Key words: phrase-based translation, preprocessing, reordering

Daniel Zeman

Real-time Traffic

Language Pairs | Preprocessing | Signal Processing | TSD 2010 | Word Order |

claim paper

» A PhraseBased Statistical Model for SMS Text Normalization

» Moses Open Source Toolkit for Statistical Machine Translation

Post Info
More Details (n/a)

Added	31 Jan 2011
Updated	31 Jan 2011
Type	Journal
Year	2010
Where	TSD
Authors	Daniel Zeman

Comments (0)

Sciweavers

Using TectoMT as a Preprocessing Tool for Phrase-Based Statistical Machine Translation

Language Pairs | Preprocessing | Signal Processing | TSD 2010 | Word Order |

Explore & Download

Productivity Tools

Sciweavers