A large number of question and answer pairs can be collected from question and answer boards and FAQ pages on the Web. This paper proposes an automatic method of finding the questions that have the same meaning. The method can detect semantically similar questions that have little word overlap because it calculates question-question similarities by using the corresponding answers as well as the questions. We develop two different similarity measures based on language modeling and compare them with the traditional similarity measures. Experimental results show that semantically similar questions pairs can be effectively found with the proposed similarity measures. Categories and Subject Descriptors H.3.0 [Information Search and Retrieval]: General General Terms Algorithms, Measurement, Experimentation Keywords Information Retrieval, FAQ retrieval, Language Models
Jiwoon Jeon, W. Bruce Croft, Joon Ho Lee