Part-of-speech tagging of Modern Hebrew text

15 years 6 months ago

Download www.cs.technion.ac.il

Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a Part-of-Speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we will aim to show, beneficial. To...

Roy Bar-Haim, Khalil Sima'an, Yoad Winter

Real-time Traffic

NLE 2008 | POS Tagger | Semitic | Tokenization |

claim paper

Post Info
More Details (n/a)

Added	14 Dec 2010
Updated	14 Dec 2010
Type	Journal
Year	2008
Where	NLE
Authors	Roy Bar-Haim, Khalil Sima'an, Yoad Winter

Comments (0)

Sciweavers

Part-of-speech tagging of Modern Hebrew text

NLE 2008 | POS Tagger | Semitic | Tokenization |

Explore & Download

Productivity Tools

Sciweavers