Sciweavers

ACL
2011

The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content

13 years 4 months ago
The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content
The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.
Omar Zaidan, Chris Callison-Burch
Added 24 Aug 2011
Updated 24 Aug 2011
Type Journal
Year 2011
Where ACL
Authors Omar Zaidan, Chris Callison-Burch
Comments (0)