A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications

14 years 5 months ago

Download www.cse.buffalo.edu

As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, SAN-based solutions, and a commercial parallel ﬁle system, and show that they are not scalable, particularly beyond 64 CPUs. We demonstrate the low overhead of our replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques.

John Paul Walters, Vipin Chaudhary

Real-time Traffic

Computational Clusters Increase | Distributed And Parallel Computing | HIPC 2007 | Most Checkpointing Techniques | Scalable Replication-based Mpi |

claim paper

Post Info
More Details (n/a)

Added	07 Jun 2010
Updated	07 Jun 2010
Type	Conference
Year	2007
Where	HIPC
Authors	John Paul Walters, Vipin Chaudhary

Comments (0)

Sciweavers

A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications

Computational Clusters Increase | Distributed And Parallel Computing | HIPC 2007 | Most Checkpointing Techniques | Scalable Replication-based Mpi |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers