On the Weakenesses of Correlation

14 years 10 months ago

Download www.fastmmw.com

The correlation of the result lists provided by search engines is fundamental and it has deep and multidisciplinary ramiﬁcations. Here, we present automatic and unsupervised methods to assess whether or not search engines provide results that are comparable or correlated. We have two main contributions: First, we provide evidence that for more than 80% of the input queries —independently of their frequency— the two major search engines share only three or fewer URLs in their search results, leading to an increasing divergence. In this scenario (divergence), we show that even the most robust measures based on comparing lists is useless to apply; that is, the small contribution by too few common items will infer no conﬁdence. Second, to overcome this problem, we propose the ﬁrst content-based measures —i.e., direct comparison of the contents from search results; these measures are based on the Jaccard ratio and distribution similarity measures (CDF measures). We show that th...

Paolo D'Alberto, Ali Dasdan

Real-time Traffic