We investigate the tasks of general morphological tagging, diacritization, and lemmatization for Arabic. We show that for all tasks we consider, both modeling the lexeme explicitly, and retuning the weights of individual classifiers for the specific task, improve the performance. 1 Previous Work Arabic has about 14 dimensions of inflection (most of them orthogonal), and in our training corpus of about 288,000 words we find 3279 complete morphological tags, with up to 100,000 possible tags. Because of the large number of tags, it is clear that morphological tagging cannot be construed as a simple classification task. Hajic (2000) is the first to use a dictionary as a source of possible morphological analyses (and hence tags) for an inflected word form, and then redefined the tagging task as a choice among the tags proposed by the dictionary, using a log-linear model trained on specific ambiguity classes for individual morphological features. Hajic et al. (2005) implement the approach o...
Ryan Roth, Owen Rambow, Nizar Habash, Mona T. Diab