Efficiency of Data Structures for Detecting Overlaps in Digital Documents

15 years 10 months ago

Download www.csse.monash.edu.au

This paper analyses the efficiency of different data structures for detecting overlap in digital documents. Most existing approaches use some hash function to reduce the space requirements for their indices of chunks. Since a hash function can produce the same value for different chunks, false matches are possible. In this paper we propose an algorithm that can be used for eliminating those false matches. This algorithm uses a suffix tree structure, which is space consuming. We define a modified suffix tree that only considers chunks starting at the beginning of words and we show how the algorithm can work on this structure. We can alternatively reduce space requirements of a suffix tree by converting it to a directed acyclic graph. We show that suffix link information can be preserved in this new structure and the matching statistics algorithm still works with those modifications that we propose.

Krisztián Monostori, Arkady B. Zaslavsky, H

Real-time Traffic

ACSC 2001 | False Matches | Hash Function | Suffix Tree | Theoretical Computer Science |

claim paper

» Identifying table boundaries in digital documents via sparse line detection

» Pedigree Tracking in the Face of Ancillary Content

» OntoMiner bootstrapping ontologies from overlapping domain specific web sites

» Nondestructive Integration of FormBased Views

» Spatiotemporal Annotation Graph STAG A Data Model for Composite Digital Objects

» Robust ChangeDetection by Normalised GradientCorrelation

» Winnowing Local Algorithms for Document Fingerprinting

» External Plagiarism Detection Based on Standard IR Technology and Fast Recognition of Comm...

Post Info
More Details (n/a)

Added	23 Aug 2010
Updated	23 Aug 2010
Type	Conference
Year	2001
Where	ACSC
Authors	Krisztián Monostori, Arkady B. Zaslavsky, Heinz W. Schmidt

Comments (0)

Sciweavers

Efficiency of Data Structures for Detecting Overlaps in Digital Documents

ACSC 2001 | False Matches | Hash Function | Suffix Tree | Theoretical Computer Science |

Explore & Download

Productivity Tools

Sciweavers