Sciweavers

NAACL
2010

Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM

13 years 10 months ago
Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM
We are interested in diacritizing Semitic languages, especially Syriac, using only diacritized texts. Previous methods have required the use of tools such as part-of-speech taggers, segmenters, morphological analyzers, and linguistic rules to produce state-of-the-art results. We present a low-resource, data-driven, and language-independent approach that uses a hybrid word- and consonant-level conditional Markov model. Our approach rivals the best previously published results in Arabic (15% WER with case endings), without the use of a morphological analyzer. In Syriac, we reduce the WER over a strong baseline by 30% to achieve a WER of 10.5%. We also report results for Hebrew and English.
Robbie Haertel, Peter McClanahan, Eric K. Ringger
Added 14 Feb 2011
Updated 14 Feb 2011
Type Journal
Year 2010
Where NAACL
Authors Robbie Haertel, Peter McClanahan, Eric K. Ringger
Comments (0)