Sciweavers

CLIN
2003

Methods for the Extraction of Hungarian Multi-Word Lexemes

14 years 28 days ago
Methods for the Extraction of Hungarian Multi-Word Lexemes
This paper describes an experiment on extracting Hungarian multi-word lexemes from a corpus, using statistical methods. Corpus preparation—the addition of POS tags and stems—was done automatically. From the corpus, verb+noun+casemark patterns were extracted as collocation candidates. Evaluation shows that the statistical methods used by Villada Moir´on (2004a) to identify Dutch V + PP collocations, can also be applied to the Hungarian data. Some collocation types (such as verbal arguments) require special extraction methods, as explained in the evaluation section. Finally, we suggest that the extraction process can be further improved by a blend of statistical techniques with rule-based and dictionary-based methods.
Balázs Kis, Begoña Villada, Gosse Bo
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Where CLIN
Authors Balázs Kis, Begoña Villada, Gosse Bouma, Gábor Ugray, Tamás Bíró, Gábor Pohl, John Nerbonne
Comments (0)