Sciweavers

IAT
2007
IEEE

An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns

14 years 5 months ago
An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns
This paper describes an intelligent agent to facilitate bitext mining from the Web via automatic discovery of URL pairing patterns (or keys) for retrieving parallel web pages. The linking power of a key, defined as the number of URL pairs that it can match, is used as the objective function for the search for the best set of keys that can find the greatest number of web page pairs within a bilingual website. Our experiments show that, with no prior knowledge such as ad hoc heuristics, no labelled data for training and no similarity analysis of Web page structure and content that are commonly involved in the existing approaches, a best-first search to approximate this optimization with an empirical threshold can recognize 98.1% true parallel web pages and discover many irregular pairing patterns that are unlikely to be discovered by other approaches.
Chunyu Kit, Jessica Yee Ha Ng
Added 02 Jun 2010
Updated 02 Jun 2010
Type Conference
Year 2007
Where IAT
Authors Chunyu Kit, Jessica Yee Ha Ng
Comments (0)