Word Length n-Grams for Text Re-use Detection

15 years 10 months ago

Download users.dsic.upv.es

Abstract. The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite eﬀective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is r...

Alberto Barrón-Cedeño, Chiara Basile

Real-time Traffic

CICLING 2010 | Documents | Exhaustive Comparison | Exhaustive Comparison Afterwards | Natural Language Processing |

claim paper

» Textimage alignment for historical handwritten documents

» An Efficient Word Segmentation Technique for Historical and Degraded MachinePrinted Docume...

» The Automatic Extraction of Open Compounds from Text Corpora

» Detecting False Matches in String Matching Algorithms

» Addressing ConceptEvolution in ConceptDrifting Data Streams

» Hidden Pattern Statistics

» BitParallel Witnesses and Their Applications to Approximate String Matching

Post Info
More Details (n/a)

Added	12 Aug 2010
Updated	12 Aug 2010
Type	Conference
Year	2010
Where	CICLING
Authors	Alberto Barrón-Cedeño, Chiara Basile, Mirko Degli Esposti, Paolo Rosso

Comments (0)

Sciweavers

Word Length n-Grams for Text Re-use Detection

CICLING 2010 | Documents | Exhaustive Comparison | Exhaustive Comparison Afterwards | Natural Language Processing |

Explore & Download

Productivity Tools

Sciweavers