Intrinsic Plagiarism Detection Using Character Trigram Distance Scores - Notebook for PAN at CLEF 2011

13 years 6 months ago

Download www.uni-weimar.de

Abstract In this paper, we describe a novel approach to intrinsic plagiarism detection. Each suspicious document is divided into a series of consecutive, potentially overlapping ‘windows’ of equal size. These are represented by vectors containing the relative frequencies of a predetermined set of high-frequency character trigrams. Subsequently, a distance matrix is set up in which each of the document’s windows is compared to each other window. The distance measure used is a symmetric adaptation of the normalized distance (nd1) proposed by Stamatatos [17]. Finally, an algorithm for outlier detection in multivariate data (based on Principal Components Analysis) is applied to the distance matrix in order to detect plagiarized sections. In the PAN-PC-2011 competition, this system (second place) achieved a competitive recall (.4279) but only reached a plagdet of .1679 due to a disappointing precision (.1075).

Mike Kestemont, Kim Luyckx, Walter Daelemans

Real-time Traffic

CLEF 2011 | Distance Measure | Information Technology | Plagiarism Detection | Principal Components Analysis |

claim paper

Post Info
More Details (n/a)

Added	18 Dec 2011
Updated	18 Dec 2011
Type	Journal
Year	2011
Where	CLEF
Authors	Mike Kestemont, Kim Luyckx, Walter Daelemans

Comments (0)

Sciweavers

Intrinsic Plagiarism Detection Using Character Trigram Distance Scores - Notebook for PAN at CLEF 2011

CLEF 2011 | Distance Measure | Information Technology | Plagiarism Detection | Principal Components Analysis |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers