As Data Grids become more commonplace, large data sets are being replicated and distributed to multiple sites, leading to the problem of determining which replica can be accessed most efficiently. The answer to this question can depend on many factors, including physical characteristics of the resources and the load behavior on the CPUs, networks, and storage devices that are part of the end-to-end path linking possible sources and sinks. We develop a predictive framework that combines (1) integrated instrumentation that collects information about the end-to-end performance of past transfers, (2) predictors to estimate future transfer times, and (3) a data delivery infrastructure that provides users with access to both the raw data and our predictions. We evaluate the performance of our predictors by applying them to log data collected from a wide area testbed. These preliminary results provide insights into the effectiveness of using predictors in this situation.
Sudharshan Vazhkudai, Jennifer M. Schopf, Ian T. F