Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

126

ACL
2008

favoriteEmaildiscussreport

168views Computational Linguistics» more ACL 2008»

Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

15 years 3 months ago

Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

Download www.aclweb.org

In statistical language modeling, one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes. In this paper we investigate the effects of applying such a technique to higherorder n-gram models trained on large corpora. We introduce a modification of the exchange clustering algorithm with improved efficiency for certain partially class-based models and a distributed version of this algorithm to efficiently obtain automatic word classifications for large vocabularies (>1 million words) using such large training corpora (>30 billion tokens). The resulting clusterings are then used in training partially class-based language models. We show that combining them with wordbased n-gram models in the log-linear model of a state-of-the-art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score.

Jakob Uszkoreit, Thorsten Brants

Real-time Traffic

ACL 2008 | Class-based Language Models | Computational Linguistics | N-gram Models | Wordbased N-gram Models |

claim paper

Related Content

» A Large Scale Distributed Syntactic Semantic and Lexical Language Model for Machine Transl...

» DryadLINQ A System for GeneralPurpose Distributed DataParallel Computing Using a HighLevel...

» Distributed dataparallel computing using a highlevel programming language

» Scalable RDMA performance in PGAS languages

» Training Continuous Space Language Models Some Practical Issues

» CrossLanguage Frame Semantics Transfer in Bilingual Corpora

» A Compilation Framework for Distributed Memory Parallelization of Data Mining Algorithms

» TakTuk adaptive deployment of remote executions

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	ACL
Authors	Jakob Uszkoreit, Thorsten Brants

Comments (0)