No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

14 years 3 months ago

Download www.umiacs.umd.edu

This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two diﬀerent languages. Solutions to this problem are of general interest for text mining in the multilingual context and have speciﬁc applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that eﬀective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central ﬁnding can be summarized as“no free lunch”: there ...

Ferhan Ture, Tamer Elsayed, Jimmy J. Lin

Real-time Traffic

English Wikipedia | Information Technology | SIGIR 2011 | Sliding Window Algorithm | Statistical Machine Translation |

claim paper

Post Info
More Details (n/a)

Added	17 Sep 2011
Updated	17 Sep 2011
Type	Journal
Year	2011
Where	SIGIR
Authors	Ferhan Ture, Tamer Elsayed, Jimmy J. Lin

Comments (0)

Sciweavers

No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

English Wikipedia | Information Technology | SIGIR 2011 | Sliding Window Algorithm | Statistical Machine Translation |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers