As distributed storage systems grow, the response time between detection and repair of the error becomes significant. Systems built on shared servers have additional complexity because of the high rate of service outages and revocation. Managing high replica counts in this environment becomes very costly in terms of the storage required and bandwidth consumption for file copies. The storage challenge for this situation can thus be phrased as an attempt to function inexpensively with respect to cost constraints such as: disk utilization, network bandwidth consumption, and server CPU time. The GEMS (Grid Enabled Molecular Simulation) storage system provides a replicated and shared workspace for large scale molecular dynamics simulations, and exemplifies the above issues. The GEMS framework offers a solution to this problem by accessing metadata, prioritizing observed faults, and repairing them in an intelligent manner. In this paper, we provide observations from the operation of GEMS an...
Justin M. Wozniak, Paul Brenner, Douglas Thain, Aa