Web Search Results Clustering in Polish: Experimental Evaluation of Carrot

15 years 8 months ago

Download www.cs.put.poznan.pl

Abstract. In this paper we consider the problem of web search results clustering in the Polish language, supporting our analysis with results acquired from an experimental system named Carrot. The algorithm we put into consideration – Suﬃx Tree Clustering has been acknowledged as being very eﬃcient when applied to English. We present conclusions from its experimental application to Polish, indicating fragile areas, where the algorithm seem to fail due to speciﬁc properties of the input data. We indicate that the characteristics of produced clusters (number, value), unlike in English, strongly depend on pre-processing phase. We also attempt to investigate the inﬂuence of two primary STC parameters: merge threshold and minimum base cluster score on the number and quality of results. Finally, we introduce two approaches to eﬃcient, approximate stemming of Polish words: quasi-stemmer and an automaton-based method. 1 Search results clustering overview Together with an exponentia...

Dawid Weiss, Jerzy Stefanowski

Real-time Traffic