Distributed storage systems must provide highly available access to data while maintaining high performance and maximum scalability. In addition, reliability in a storage system is of the utmost importance and the correctness and availability of data must be guaranteed. We have designed the Sigma cluster file system to address these goals by distributing data across multiple nodes and keeping parity across these nodes. With data spread across multiple nodes, however, ensuring the consistency of the data requires special techniques. In this paper, we describe fault tolerant algorithms to maintain the consistency and reliability of the file system - both data and metadata. We show how these techniques guarantee data integrity and availability even under failure mode scenarios.
Jonathan D. Bright, John A. Chandy