Sciweavers

ACSW
2004

Discovering Parallel Text from the World Wide Web

14 years 1 months ago
Discovering Parallel Text from the World Wide Web
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents from the World Wide Web. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. The experiment conducted to a multilingual Web site shows the effectiveness of the system.
Jisong Chen, Rowena Chau, Chung-Hsing Yeh
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2004
Where ACSW
Authors Jisong Chen, Rowena Chau, Chung-Hsing Yeh
Comments (0)