Sciweavers

ACL
2008

Pairwise Document Similarity in Large Collections with MapReduce

14 years 1 months ago
Pairwise Document Similarity in Large Collections with MapReduce
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a collection consisting of approximately 900,000 newswire articles, our algorithm exhibits linear growth in running time and space in terms of the number of documents.
Tamer Elsayed, Jimmy J. Lin, Douglas W. Oard
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where ACL
Authors Tamer Elsayed, Jimmy J. Lin, Douglas W. Oard
Comments (0)