Lower-bounding term frequency normalization

14 years 2 months ago

Download sifaka.cs.uiuc.edu

In this paper, we reveal a common deﬁciency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and eﬃcient method to introduce a suﬃciently large lower bound for TF normalization which can be shown analytically to ﬁx or alleviate the...

Yuanhua Lv, ChengXiang Zhai

Real-time Traffic