Familiar evaluation methodologies for information retrieval (IR) are not well suited to the task of comparing systems in many real settings. These systems and evaluation methods must support contextual, interactive retrieval over changing, heterogeneous data collections, including private and confidential information. We have implemented a comparison tool which can be inserted into the natural IR process. It provides a familiar search interface, presents a small number of result sets in side-by-side panels, elicits searcher judgments, and logs interaction events. The tool permits study of real information needs as they occur, uses the documents actually available at the time of the search, and records judgments taking into account the instantaneous needs of the searcher. We have validated our proposed evaluation approach and explored potential biases by comparing different whole-ofWeb search facilities using a Web-based version of the tool. In four experiments, one with supplied queri...