Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation

15 years 9 months ago

Download papers.ldc.upenn.edu

The Arabic Treebank (ATB), released by the Linguistic Data Consortium, contains multiple annotation files for each source file, due in part to the role of diacritic inclusion in the annotation process. The data is made available in both "vocalized" and "unvocalized" forms, with and without the diacritic marks, respectively. Much parsing work with the ATB has used the unvocalized form, on the basis that it more closely represents the "real-world" situation. We point out some problems with this usage of the unvocalized data and explain why the unvocalized form does not in fact represent "real-world" data. This is due to some aspects of the treebank annotation that to our knowledge have never before been published.

Mohamed Maamouri, Seth Kulick, Ann Bies

Real-time Traffic

Annotation | Education | Linguistic Data Consortium | LREC 2008 | Multiple Annotation Files |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Mohamed Maamouri, Seth Kulick, Ann Bies

Comments (0)

Sciweavers

Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation

Annotation | Education | Linguistic Data Consortium | LREC 2008 | Multiple Annotation Files |

Explore & Download

Productivity Tools

Sciweavers