Parallel web pages are important source of training data for statistical machine translation. In this paper, we present a new approach to sentence alignment on parallel web pages....
This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree align...
The quality of a statistical machine translation (SMT) system is heavily dependent upon the amount of parallel sentences used in training. In recent years, there have been several...
Mining bilingual data (including bilingual sentences and terms1 ) from the Web can benefit many NLP applications, such as machine translation and cross language information retrie...
Long Jiang, Shiquan Yang, Ming Zhou, Xiaohua Liu, ...
This paper1 presents an empirical approach to mining parallel corpora. Conventional approaches use a readily available collection of comparable, nonparallel corpora to extract par...