It is now common to encounter communities engaged in the collaborative analysis and transformation of large quantities of data over extended time periods. We argue that these communities require a scalable system for managing, tracing, communicating, and exploring the derivation and analysis of diverse data objects. Such a system could bring significant productivity increases, facilitating discovery, understanding, assessment, and sharing of both data and transformation resources, as well as the productive use of distributed resources for computation, storage, and collaboration. We define a model and architecture for a virtual data grid to address this requirement. Using a broadly applicable “typed dataset” as the unit of derivation tracking, we introduce simple constructs for describing how datasets are derived from transformations and from other datasets. We also define mechanisms for integrating with, and adapting to, existing data management systems and transformation and anal...
Ian T. Foster