This paper introduces a framework for clarifying and formalizing the duplicate document detection problem. Four distinct models are presented, each with a corresponding algorithm ...
Skew detection via principal components is proposed as an e ective methodforimageswhich contain other parts than text. It is shown that the negative of the image leads to much mor...