We consider using machine learning techniques to help understand a large software system. In particular, we describe how learning techniques can be used to reconstruct abstract Datalog specifications of a certain type of database software from examples of its operation. In a case study involving a large (more than one million lines of C) real-world software system, we demonstrate that off-the-shelf inductive logic programming methods can be successfully used for specification recovery; specifically, Grende12 can extract specifications for about one-third of the modules in a test suite with high rates of precision and recall. We then describe two extensions to Grende12 which improve performance on this task: one which allows it to output a set of candidate hypotheses, and another which allows it to output specifications containing determinations. In combination, these extensions enable specifications to be extracted for nearly two-thirds of the benchmark modules with perfect recall, an...
William W. Cohen