Web archives play an important role in preserving our cultural heritage for future generations. When searching them, a serious problem arises from the fact that terminology evolves constantly. Since today’s users formulate queries using current terminology, old but relevant documents are often not retrieved. The query saint petersburg museum, for instance, does not retrieve documents from the 1970s about museums in Leningrad (the former name of Saint Petersburg). We address this problem by determining query reformulations that paraphrase the user’s information need using terminology prevalent in the past. A measure of across-time semantic similarity that assesses the degree of relatedness between two terms when used at different times is proposed. Using this measure as a crucial building block, we propose a novel query reformulation technique based on a hidden Markov model (HMM). Experiments on twenty years worth of New York Times articles demonstrate the usefulness and efficienc...
Klaus Berberich, Srikanta J. Bedathur, Mauro Sozio