Sciweavers

SAC
2015
ACM

Multi-component similarity method for web product duplicate detection

8 years 7 months ago
Multi-component similarity method for web product duplicate detection
Due to the growing number of Web shops, aggregating product data from the Web is growing in importance. One of the problems encountered in product aggregation is duplicate detection. In this paper, we extend and significantly improve an existing state-of-the-art product duplicate detection method. Our approach employs a novel method for combining the titles’ and the attributes’ similarities into a final product similarity. We use q-grams to handle partial matching of words, such as abbreviations. Where existing methods cluster products of only two Web shops, we propose a hierarchical clustering method to handle multiple Web shops. Applying our new method to a dataset of TV’s from four Web shops reveals that it significantly outperforms the Hybrid Similarity Method, the Title Model Words Method, and the well-known TF-IDF method, with an F1 score of 0.475 compared to 0.287, 0.298, and 0.335, respectively.
Ronald van Bezu, Sjoerd Borst, Rick Rijkse, Jim Ve
Added 17 Apr 2016
Updated 17 Apr 2016
Type Journal
Year 2015
Where SAC
Authors Ronald van Bezu, Sjoerd Borst, Rick Rijkse, Jim Verhagen, Damir Vandic, Flavius Frasincar
Comments (0)