Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

16 years 1 months ago

Download icl.cs.utk.edu

ct Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications diﬀerent methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FTMPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this eﬀected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.

Graham E. Fagg, Thara Angskun, George Bosilca, Jel

Real-time Traffic

Fault Tolerant Mpi | PVM 2005 | Recovery Algorithms | State Recovery Algorithm |

claim paper

» A Scalable Asynchronous ReplicationBased Strategy for Fault Tolerant MPI Applications

» FTCCharm an inmemory checkpointbased fault tolerant runtime for Charm and MPI

» Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

» Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols

» Hyper Butterfly Network A Scalable Optimally Fault Tolerant Architecture

» High performance linpack benchmark a fault tolerant implementation without checkpointing

» Scalable FaultTolerant Distributed Shared Memory

» PLDA Parallel Latent Dirichlet Allocation for LargeScale Applications

Post Info
More Details (n/a)

Added	28 Jun 2010
Updated	28 Jun 2010
Type	Conference
Year	2005
Where	PVM
Authors	Graham E. Fagg, Thara Angskun, George Bosilca, Jelena Pjesivac-Grbovic, Jack Dongarra

Comments (0)

Sciweavers

Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

Fault Tolerant Mpi | PVM 2005 | Recovery Algorithms | State Recovery Algorithm |

Explore & Download

Productivity Tools

Sciweavers