A Trainable Tokenizer, solution for multilingual texts and compound expression tokenization

15 years 8 months ago

Download www.lrec-conf.org

Tokenization is one of the initial steps done for almost any text processing task. It is not particularly recognized as a challenging task for English monolingual systems but it rapidly increases in complexity for systems that apply it for different languages. This article proposes a supervised learning approach to perform the tokenization task. The method presented in this article is based on character transitions representation, a representation that allows compound expressions to be recognized as a single token. Compound tokens are identified independent of the character that creates the expression. The method automatically learns tokenization rules from a pre-tokenized corpus. The results obtained using the trainable system show that for Romanian and English a statistical significant improvement is obtained over a baseline system that tokenizes texts on every non-alphanumeric character.

Oana Frunza

Real-time Traffic

Education | LREC 2008 | Text Processing Task | Tokenization | Tokenization Task |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Oana Frunza

Comments (0)

Sciweavers

A Trainable Tokenizer, solution for multilingual texts and compound expression tokenization

Education | LREC 2008 | Text Processing Task | Tokenization | Tokenization Task |

Explore & Download

Productivity Tools

Sciweavers