Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers because of its importance in Web search. Local text reuse occurs when sentences, facts or passages, rather than whole documents, are reused and modified. Detecting this type of reuse can be the basis of new tools for text analysis. In this paper, we introduce a new approach to detecting local text reuse and compare it to other approaches. This comparison involves a study of the amount and type of reuse that occurs in real documents, including TREC newswire and blog collections. Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Indexing methods General Terms Algorithms, Measurement, Experimentation Keywords Text reuse, fingerprinting, information flow
Jangwon Seo, W. Bruce Croft