Taking advantage of the well-known cluster hypothesis that “closely associated documents tend to be relevant to the same request”, we can use inter-document similarity to provide more accurate and robust evaluation of retrieval systems. Using our method, we are able to accurately rank retrieval systems with up to 99% fewer relevance judgments than collected for the TREC conferences, and significantly more accurately than other algorithms given the same number of judgments. Categories and Subject Descriptors: H.3 Information Storage and Retrieval; H.3.4 Systems and Software: Performance Evaluation General Terms: Experimentation, Measurement