Electronic mail poses a number of unusual challenges for the design of information retrieval systems and test collections, including informal expression, conversational structure, variable document granularity (e.g., messages, threads, or longer-term interactions), a naturally occuring integration between free text and structural metadata, and incompletely characterized user needs. This paper reports on initial experiments with a large collection of public mailing lists from the World Wide Web consortium that will be used for the TREC 2005 Enterprise Search Track. Automatic subject-line threading and removal of duplicated text were found to have little effect in a small pilot study. Those observations motivated development of a question typology and more detailed analysis of collection characteristics; preliminary results for both are reported. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Performance, Exp...
Yejun Wu, Douglas W. Oard