Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM

15 years 4 months ago

Download www.aclweb.org

We are interested in diacritizing Semitic languages, especially Syriac, using only diacritized texts. Previous methods have required the use of tools such as part-of-speech taggers, segmenters, morphological analyzers, and linguistic rules to produce state-of-the-art results. We present a low-resource, data-driven, and language-independent approach that uses a hybrid word- and consonant-level conditional Markov model. Our approach rivals the best previously published results in Arabic (15% WER with case endings), without the use of a morphological analyzer. In Syriac, we reduce the WER over a strong baseline by 30% to achieve a WER of 10.5%. We also report results for Hebrew and English.

Robbie Haertel, Peter McClanahan, Eric K. Ringger

Real-time Traffic

Computational Linguistics | Conditional Markov Model | Morphological Analyzer | NAACL 2010 | Part-of-speech Taggers |

claim paper

Post Info
More Details (n/a)

Added	14 Feb 2011
Updated	14 Feb 2011
Type	Journal
Year	2010
Where	NAACL
Authors	Robbie Haertel, Peter McClanahan, Eric K. Ringger

Comments (0)

Sciweavers

Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM

Computational Linguistics | Conditional Markov Model | Morphological Analyzer | NAACL 2010 | Part-of-speech Taggers |

Explore & Download

Productivity Tools

Sciweavers