Students are often asked to submit electronic copies of their program code as part of assessment in computer science courses. To counter code plagiarism, educational institutions use tools to detect similarity between submissions. Previous research has identified that using a modified text search engine to identify similar code within large code collections is both efficient and effective. The similarity functions used internally by such search engines have historically been devised manually by experts in the field; in this work, we investigate the practicality of using evolutionary computing techniques to evolve similarity functions. We use particle swarm optimisation to find optimal values of variables in human constructed similarity functions, and use genetic programming to generate new similarity functions specifically for this task. We show empirically that our optimised similarity functions perform better than standard Okapi BM25 across a range of collections. Our results ...
Victor Ciesielski, Nelson Wu, Seyed M. M. Tahaghog