Sciweavers

SIGIR
2005
ACM

Web-based acquisition of Japanese katakana variants

14 years 5 months ago
Web-based acquisition of Japanese katakana variants
This paper describes a method of detecting Japanese Katakana variants from a large corpus. Katakana words, which are mainly used as loanwords, cause problems with information retrieval and so on, because transliteration creates several variations in spelling and all of these can be orthographic. Previous work manually defined Katakana rewrite rules such as (be) and (ve), for generating variants and also defined the weight of each operation to edit one string into another to detect these variants. However, this research has not been able to keep up with the ever-increasing number of loanwords and their variants. With our method proposed in this paper, the weight of each edit operation is mechanically assigned based on Web data. In experiments, it performed almost as well as one with manually determined weights. It also achieved 98.6% recall and 86.3% precision in the task of extracting Katakana variant pairs from a 38-year corpus of Japanese newspaper articles. Categories and Subject...
Takeshi Masuyama, Hiroshi Nakagawa
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where SIGIR
Authors Takeshi Masuyama, Hiroshi Nakagawa
Comments (0)