Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM

13 years 10 months ago

Download www.aclweb.org

We are interested in diacritizing Semitic languages, especially Syriac, using only diacritized texts. Previous methods have required the use of tools such as part-of-speech taggers, segmenters, morphological analyzers, and linguistic rules to produce state-of-the-art results. We present a low-resource, data-driven, and language-independent approach that uses a hybrid word- and consonant-level conditional Markov model. Our approach rivals the best previously published results in Arabic (15% WER with case endings), without the use of a morphological analyzer. In Syriac, we reduce the WER over a strong baseline by 30% to achieve a WER of 10.5%. We also report results for Hebrew and English.

Robbie Haertel, Peter McClanahan, Eric K. Ringger

Real-time Traffic

Computational Linguistics | Conditional Markov Model | Morphological Analyzer | NAACL 2010 | Part-of-speech Taggers |

claim paper

Post Info
More Details (n/a)

Added	14 Feb 2011
Updated	14 Feb 2011
Type	Journal
Year	2010
Where	NAACL
Authors	Robbie Haertel, Peter McClanahan, Eric K. Ringger

Comments (0)

Sciweavers

Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM

Computational Linguistics | Conditional Markov Model | Morphological Analyzer | NAACL 2010 | Part-of-speech Taggers |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers