In this paper, we report on the design of a part-of-speech-tagset for Wolof and on the creation of a semi-automatically annotated gold standard. The main motivation for this resource is to obtain data for training automatic taggers with machine learning approaches. Hence, we take machine learning considerations into account during tagset design and present training experiments as part of this paper. The best automatic tagger achieves an accuracy of 95.2% in cross-validation experiments. We also wanted to create a basis for experimenting with annotation projection techniques, which exploit parallel corpora. For this reason, it was useful to use a part of the Bible as the gold standard corpus, for which sentence-aligned parallel versions in many languages are easy to obtain.
Cheikh M. Bamba Dione, Jonas Kuhn, Sina Zarrie&szl