Sciweavers

52 search results - page 1 / 11
» Finding near-duplicate web pages: a large-scale evaluation o...
Sort
View
WWW
2008
ACM
14 years 8 months ago
Efficient similarity joins for near duplicate detection
With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we ...
Chuan Xiao, Wei Wang 0011, Xuemin Lin, Jeffrey Xu ...
CPM
2000
Springer
177views Combinatorics» more  CPM 2000»
13 years 11 months ago
Identifying and Filtering Near-Duplicate Documents
Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch...
Andrei Z. Broder
WWW
2007
ACM
14 years 8 months ago
P-TAG: large scale automatic generation of personalized annotation tags for the web
The success of the Semantic Web depends on the availability of Web pages annotated with metadata. Free form metadata or tags, as used in social bookmarking and folksonomies, have ...
Paul-Alexandru Chirita, Stefania Costache, Wolfgan...
SIGIR
2006
ACM
14 years 1 months ago
Finding near-duplicate web pages: a large-scale evaluation of algorithms
Broder et al.’s [3] shingling algorithm and Charikar’s [4] random projection based approach are considered “state-of-theart” algorithms for finding near-duplicate web pag...
Monika Rauch Henzinger
WEBI
2004
Springer
14 years 24 days ago
Finding Related Pages Using the Link Structure of the WWW
Most of the current algorithms for finding related pages are exclusively based on text corpora of the WWW or incorporate only authority or hub values of pages. In this paper, we ...
Paul-Alexandru Chirita, Daniel Olmedilla, Wolfgang...