In this paper we address issues related to building a large-scale Chinese corpus. We try to answer four questions: (i) how to speed up annotation, (ii) how to maintain high annota...
A new statistical method called "bilingual chunking" for structure alignment is proposed. Different with the existing approaches which align hierarchical structures like...
Wei Wang, Ming Zhou, Jin-Xia Huang, Changning Huan...
In some contexts, well-formed natural language cannot be expected as input to information or communication systems. In these contexts, the use of grammar-independent input (sequen...
Statistical methods for PP attachment fall into two classes according to the training material used: first, unsupervised methods trained on raw text corpora and second, supervised...
We describe a method for generating sentences from "keywords" or "headwords". This method consists of two main parts, candidate-text construction and evaluatio...
This paper describes a project tagging a spontaneous speech corpus with morphological information such as word segmentation and parts-ofspeech. We use a morphological analysis sys...
The paper presents a statistical approach to automatic building of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, a baseline iterative m...
Syllable-to-word (STW) conversion is important in Chinese phonetic input methods and speech recognition. There are two major problems in the STW conversion: (1) resolving the ambi...
This paper proposes a multi-dimensional framework for classifying text documents. In this framework, the concept of multidimensional category model is introduced for representing ...