Over 200 CVS repositories representing the assignments of students in a second year undergraduate computer science course have been assembled. This unique data set represents many individuals working separately on identical projects, presenting the opportunity to evaluate the effects of the work habits captured by CVS on performance. This paper outlines our experiences mining and analyzing these repositories. We extracted various quantitative measures of student behaviour and code quality, and attempted to correlate these features with grades. Despite examining 166 features, we find that grade performance cannot be accurately predicted; certainly no predictors stronger than simple lines-of-code were found.
Keir Mierle, Kevin Laven, Sam T. Roweis, Greg Wils