In high energy physics, bioinformatics, and other disciplines, we encounter applications involving numerous, loosely coupled jobs that both access and generate large data sets. So-called Data Grids seek to harness geographically distributed resources for such large-scale data-intensive problems. Yet effective scheduling in such environments is challenging, due to a need to address a variety of metrics and constraints (e.g., resource utilization, response time, global and local allocation policies) while dealing with multiple, potentially independent sources of jobs and a large number of storage, compute, and network resources. We describe a scheduling framework that addresses these problems. Within this framework, data movement operations may be either tightly bound to job scheduling decisions or, alternatively, performed by a decoupled, asynchronous process on the basis of observed data access patterns and load. We develop a family of job scheduling and data movement (replication) al...
Kavitha Ranganathan, Ian T. Foster