Data mining is increasingly performed by people who are not computer scientists or professional programmers. It is often done as an iterative process involving multiple ad-hoc tasks, as well as data pre- and post-processing, all of which must be executed over large databases. In order to make data mining more accessible, it is critical to provide a simple, easy-to-use language that allows the user to specify adhoc data processing, model construction, and model manipulation. Simultaneously, it is necessary for the underlying system to scale up to large datasets. Unfortunately, while each of these requirements can be satisfied, individually, by existing systems, no system fully satisfies all criteria. In this paper, we present a system called Splash to fill this void. Splash supports an extended relational data model and SQL query language, which allows for the natural integration of statistical modeling and ad-hoc data processing. It also supports a novel representatives operator to he...