Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

211

ACSW
2004

192views Security Privacy» more ACSW 2004»

Discovering Parallel Text from the World Wide Web

15 years 8 months ago

Discovering Parallel Text from the World Wide Web

Download crpit.com

Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents from the World Wide Web. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. The experiment conducted to a multilingual Web site shows the effectiveness of the system.

Jisong Chen, Rowena Chau, Chung-Hsing Yeh

Real-time Traffic

ACSW 2004 | ACSW 2007 | Multilingual | Parallel Corpus | Parallel Documents |

claim paper

Related Content

» Learning to Extract TextBased Information from the World Wide Web

» Using Text Analysis to Understand the Structure and Dynamics of the World Wide Web as a Mu...

» Sound Music and Textual Associations on the World Wide Web

» Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web

» WebML Querying the WorldWide Web for Resources and Knowledge

» Integrating Temporal Media and Open Hypermedia on the World Wide Web

» World Wide Web navigation aid

» Organizing and searching the world wide web of facts step two harnessing the wisdom of th...

» World Wide Web A Multilingual Language Resource

Post Info
More Details (n/a)

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2004
Where	ACSW
Authors	Jisong Chen, Rowena Chau, Chung-Hsing Yeh

Comments (0)