evel of abstraction, we can represent a workflow as a directed graph with operators (or tasks) at the vertices (see Figure 1). Each operator takes inputs from data sources or from the outputs of predecessor operators in the graph, performs some computation, and produces either outputs that feed other operators or final, desired outputs. Large industrial workflows take as inputs tens to millions of files and routinely process terabytesized data sets. Operators are often domain- and application-specific. Logically, a workflow-management system can be implemented as a set of services that hide its implementation details. That is, they can hide whether the operators execute on a single machine or are distributed across several machines, how big the operands are, and how long the processing takes (as long as it’s fast enough). Users can implement small workflows on a single machine using home-brew technology, but curious minds want to know how to deal with large workflows. Although we can...
Craig W. Thompson, Wing Ning Li, Zhichun Xiao