Urdu Word Segmentation

13 years 11 months ago

Download www.crulp.org

Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.

Nadir Durrani, Sarmad Hussain

Real-time Traffic

Computational Linguistics | NAACL 2010 | Urdu | Word Segmentation | Word Segmentation Challenge |

claim paper

Post Info
More Details (n/a)

Added	14 Feb 2011
Updated	14 Feb 2011
Type	Journal
Year	2010
Where	NAACL
Authors	Nadir Durrani, Sarmad Hussain

Comments (0)

Sciweavers

Urdu Word Segmentation

Computational Linguistics | NAACL 2010 | Urdu | Word Segmentation | Word Segmentation Challenge |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers