Tokenization: Returning to a Long Solved Problem - A Survey, Contrastive Experiment, Recommendations, and Toolkit -

13 years 9 months ago

Download aclweb.org

We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows ﬂexible conﬁguration to diverse use cases (e.g. to genreor domain-speciﬁc idiosyncrasies).

Rebecca Dridan, Stephan Oepen

Real-time Traffic

ACL 2012 | Computational Linguistics | Penn Treebank | Subtleties | Unmatched Accuracy |

claim paper

Post Info
More Details (n/a)

Added	29 Sep 2012
Updated	29 Sep 2012
Type	Journal
Year	2012
Where	ACL
Authors	Rebecca Dridan, Stephan Oepen

Comments (0)

Sciweavers

Tokenization: Returning to a Long Solved Problem - A Survey, Contrastive Experiment, Recommendations, and Toolkit -

ACL 2012 | Computational Linguistics | Penn Treebank | Subtleties | Unmatched Accuracy |

Explore & Download

Productivity Tools

Sciweavers