Sciweavers

LREC
2010

Adapting Chinese Word Segmentation for Machine Translation Based on Short Units

14 years 28 days ago
Adapting Chinese Word Segmentation for Machine Translation Based on Short Units
In Chinese texts, words composed of single or multiple characters are not separated by spaces, unlike most western languages. Therefore Chinese word segmentation is considered an important first step in machine translation (MT) and its performance impacts MT results. Many factors affect Chinese word segmentations, including the segmentation standards and segmentation strategies. The performance of a corpus-based word segmentation model depends heavily on the quality and the segmentation standard of the training corpora. However, we observed that existing manually annotated Chinese corpora tend to have low segmentation granularity and provide poor morphological information due to the present segmentation standards. In this paper, we introduce a short-unit standard of Chinese word segmentation, which is particularly suitable for machine translation, and propose a semi-automatic method of transforming the existing corpora into the ones that can satisfy our standards. We evaluate the usef...
Yiou Wang, Kiyotaka Uchimoto, Jun'ichi Kazama, Can
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where LREC
Authors Yiou Wang, Kiyotaka Uchimoto, Jun'ichi Kazama, Canasai Kruengkrai, Kentaro Torisawa
Comments (0)