Pairwise Document Similarity in Large Collections with MapReduce

14 years 4 months ago

Download www.umiacs.umd.edu

This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a collection consisting of approximately 900,000 newswire articles, our algorithm exhibits linear growth in running time and space in terms of the number of documents.

Tamer Elsayed, Jimmy J. Lin, Douglas W. Oard

Real-time Traffic

ACL 2008 | Computational Linguistics | Document Similarity | Large Document Collections | Pairwise Document Similarity |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	ACL
Authors	Tamer Elsayed, Jimmy J. Lin, Douglas W. Oard

Comments (0)

Sciweavers

Pairwise Document Similarity in Large Collections with MapReduce

ACL 2008 | Computational Linguistics | Document Similarity | Large Document Collections | Pairwise Document Similarity |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers