We examine the problem of retrieving the top-m ranked items from a large collection, randomly distributed across an n-node system. In order to retrieve the top m overall, we must retrieve the top m from the subcollection stored on each node and merge the results. However, if we are willing to accept a small probability that one or more of the top-m items may be missed, it is possible to reduce computation time by retrieving only the top k < m from each node. In this paper, we demonstrate that this simple observation can be exploited in a realistic application to produce a substantial efficiency improvement without compromising the quality of the retrieved results. To support our claim, we present a statistical model that predicts the impact of the optimization. The paper is structured around a specific application — passage retrieval for question answering — but the primary results are more broadly applicable. Categories and Subject Descriptors H.3.4 [Information Systems]: Inf...
Charles L. A. Clarke, Egidio L. Terra