The deluge of huge data sets such as those provided by
sensor networks, online transactions, and the web provide
exciting opportunities for data analysis. The scale of the
data makes it impossible to process in a reasonable amount
of time on isolated machines. This has led to data flow systems emerging as the standard tool for solving research problems using these vast datasets. In typical dataflow systems,
runtimes like Dryad [3] and Streamline [1] define graphs of
processes, the edges of the graphs representing pipes, and
their vertices representing computation. Within these run-times a new class of languages such as Sawzall [6] can be used
by researchers to solve ”pleasantly parallel” problems (problems where the individual elements of datasets are considered to be independent of any other element) more quickly
without worrying about explicit concurrency.
These languages provide automated control flow (typically
matched to the architecture of the underlying runtim...