This paper gives an overview of the evaluation method used for the Web Retrieval Task in the Third NTCIR Workshop, which is currently in progress. In the Web Retrieval Task, we try to assess the retrieval effectiveness of each Web search engine system using a common data set, and attempt to build a re-usable test collection suitable for evaluating Web search engine systems. With these objectives, we have built 100gigabyte and 10-gigabyte document sets, mainly gathered from the `.jp' domain. Relevance judgment is performed on the retrieved documents, which are written in Japanese or English. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval General Terms Experimentation, Human Factors, Performance, Reliability Keywords evaluation method, test collection, Web information retrieval