Focused web crawling in the acquisition of comparable corpora

15 years 6 months ago

Download www.info.uta.fi

CLIR resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora in the genomics domain. The acquired corpora were used to statistically translate domainspecific words. The same words were also translated using a high-quality, but non-genomics-related parallel corpus, which fared considerably worse. We also evaluated our system with standard IR experiments, combining statistical translation using the Web corpora with dictionary-based translation. The results showed improvement over pure dictionary-based translation. Therefore, mining the Web for comparable corpora seems promising.

Tuomas Talvensaari, Ari Pirkola, Kalervo Järv

Real-time Traffic

CLIR Resources | Comparable Corpora | Dictionary-based Translation | IR 2008 | Natural Language Processing |

claim paper

» Focused Crawling Using Latent Semantic Indexing An Application for Vertical Search Engine...

» Combining Text and Link Analysis for Focused Crawling

» Reinforcement Learning with Classifier Selection for Focused Crawling

» xCrawl A HighRecall Crawling Method for Web Mining

» Crawling Deep Web Using a New Set Covering Algorithm

» Topical Crawling for Business Intelligence

» Collecting paraphrase corpora from volunteer contributors

» Automatic Acquisition of Lexical Formality

Post Info
More Details (n/a)

Added	12 Dec 2010
Updated	12 Dec 2010
Type	Journal
Year	2008
Where	IR
Authors	Tuomas Talvensaari, Ari Pirkola, Kalervo Järvelin, Martti Juhola, Jorma Laurikkala

Comments (0)

Sciweavers

Focused web crawling in the acquisition of comparable corpora

CLIR Resources | Comparable Corpora | Dictionary-based Translation | IR 2008 | Natural Language Processing |

Explore & Download

Productivity Tools

Sciweavers