Using Fuzzy-Word Correlation Factors to Compute Document Similarity Based on Phrase Matching

15 years 1 months ago

Download dml.cs.byu.edu

One of the Web information Retrieval (IR) problems these days is to identify redundant information that exist in (replicated) Web documents. These documents can easily be found in several forms, such as documents in different versions, small documents combined with others to form a larger document, etc. As the Web is becoming more and more popular, the number of documents on the Web is increasing on a daily basis, and ﬁltering redundant ones among this huge number of documents becomes a more difﬁcult and an urgent task. As one of the solutions to this problem, we present a new method that identiﬁes similar documents based on phrase matching using the fuzzy-word correlation factors among words in phrases. Since phrases can be treated as sequences of words in a sentence in any document, we consider the correlation factors of different words in any two phrases of two different documents to determine the degree of similarity of the phrases, which in turns can determine the similarit...

Jun won Lee, Yiu-Kai Ng

Real-time Traffic