Along with the blossom of open source projects comes the convenience for software plagiarism. A company, if less self-disciplined, may be tempted to plagiarize some open source projects for its own products. Although current plagiarism detection tools appear sufficient for academic use, they are nevertheless short for fighting against serious plagiarists. For example, disguises like statement reordering and code insertion can effectively confuse these tools. In this paper, we develop a new plagiarism detection tool, called GPlag, which detects plagiarism by mining program dependence graphs (PDGs). A PDG is a graphic representation of the data and control dependencies within a procedure. Because PDGs are nearly invariant during plagiarism, GPlag is more effective than state-of-the-art tools for plagiarism detection. In order to make GPlag scalable to large programs, a statistical lossy filter is proposed to prune the plagiarism search space. Experiment study shows that GPlag is both ef...
Chao Liu 0001, Chen Chen, Jiawei Han, Philip S. Yu