Low-Density Language Bootstrapping: the Case of Tajiki Persian

15 years 9 months ago

Download www.lrec-conf.org

Low-density languages raise difficulties for standard approaches to natural language processing that depend on large online corpora. Using Persian as a case study, we propose a novel method for bootstrapping MT capability for a low-density language in the case where it relates to a higher density variant. Tajiki Persian is a low-density language that uses the Cyrillic alphabet, while Iranian Persian (Farsi) is written in an extended version of the Arabic script and has many computational resources available. Despite the orthographic differences, the two languages have literary written forms that are almost identical. The paper describes the development of a comprehensive finite-state transducer that converts Tajik text to Farsi script and runs the resulting transliterated document through an existing Persian-to-English MT system. Due to divergences that arise in mapping the two writing systems and phonological and lexical distinctions, the system uses contextual cues (such as the posi...

Karine Megerdoomian, Dan Parvaz

Real-time Traffic

Education | Languages Raise Difficulties | Low-density Language | LREC 2008 | Natural Language Processing |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Karine Megerdoomian, Dan Parvaz

Comments (0)

Sciweavers

Low-Density Language Bootstrapping: the Case of Tajiki Persian

Education | Languages Raise Difficulties | Low-density Language | LREC 2008 | Natural Language Processing |

Explore & Download

Productivity Tools

Sciweavers