Mining Internet-Scale Software Repositories

15 years 8 months ago

Download books.nips.cc

Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we ﬁrst develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data ﬁrst reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing deve...

Erik Linstead, Paul Rigor, Sushil Krishna Bajracha

Real-time Traffic

Data ﬁrst Reveal | Information Technology | Lexical Containment Distributions | NIPS 2007 | Source Code |

claim paper

» An experience report on scaling tools for mining software repositories using MapReduce

» Construction of OntologyBased Software Repositories by Text Mining

» Quality Classifiers for Open Source Software Repositories

» A study of the contributors of PostgreSQL

» Mining Software Repositories for Software Change Impact Analysis A Case Study

» On mining data across software repositories

» Using software evolution history to facilitate development and maintenance

» Repository software evaluation using the audit checklist for certification of trusted digi...

Post Info
More Details (n/a)

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2007
Where	NIPS
Authors	Erik Linstead, Paul Rigor, Sushil Krishna Bajracharya, Cristina Videira Lopes, Pierre Baldi

Comments (0)

Sciweavers

Mining Internet-Scale Software Repositories

Data ﬁrst Reveal | Information Technology | Lexical Containment Distributions | NIPS 2007 | Source Code |

Explore & Download

Productivity Tools

Sciweavers