—As parallel file systems span larger and larger numbers of nodes in order to provide the performance and scalability necessary for modern cluster applications, the need for fault-tolerance and high data availability file systems has arisen. Modern parallel file systems spanning tens, hundreds, or even thousands of servers will require fault tolerance to avoid job failure and catastrophic data loss due to a single disk failure or server loss. Effective fault tolerance in parallel file systems must provide a high degree of data resiliency, consistency, and scalable performance. In this paper, we describe a data replication technique that meets the resiliency and consistency requirements of parallel file systems and provides scalable performance. We measure the performance of our proposed mechanism by implementing it in a popular parallel file system, PVFS.
Bradley W. Settlemyer, Walter B. Ligon III