High-performance computing increasingly occurs on "computational grids" composed of heterogeneous and geographically distributed systems of computers, networks, and storage devices that collectively act as a single "virtual" computer. One of the great challenges for this environment is to provide efficient access to data that is distributed across remote data servers in a grid. In this paper, we describe our solution, a framework we call armada. The framework allows applications and dataset providers to flexibly compose graphs of processing modules that describe the distribution, application interfaces, and processing required of the dataset before computation. The armada runtime system then restructures the graph, and places the processing modules at appropriate hosts to reduce network traffic.