Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

14 years 4 months ago

Download www.cs.utk.edu

Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. In this paper we extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging a...

Pierre Lemarinier, Aurelien Bouteiller, Thomas H&e

Real-time Traffic

CLUSTER 2004 | Distributed And Parallel Computing | Message Logging | Message Logging Protocol | Pessimistic Message |

claim paper

Post Info
More Details (n/a)

Added	20 Aug 2010
Updated	20 Aug 2010
Type	Conference
Year	2004
Where	CLUSTER
Authors	Pierre Lemarinier, Aurelien Bouteiller, Thomas Hérault, Géraud Krawezik, Franck Cappello

Comments (0)

Sciweavers

Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

CLUSTER 2004 | Distributed And Parallel Computing | Message Logging | Message Logging Protocol | Pessimistic Message |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers