Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

90

IPPS
2007
IEEE

favoriteEmaildiscussreport

95views Distributed And Parallel Com...» more IPPS 2007»

Implementing and Evaluating Automatic Checkpointing

15 years 8 months ago

Implementing and Evaluating Automatic Checkpointing

Download www.cecs.uci.edu

As the size and popularity of computer clusters go on growing, fault tolerance is becoming a crucial factor to ensure high performance and reliability for applications. To provide this facility, a checkpoint mechanism is used to recover a failed parallel application rolling it back to an execution moment prior to occurrence of the failure. In this work we present a mechanism for managing checkpoint operations during the failures automatically. This mechanism records periodically the application’s context, identifies failed nodes and restarts MPI processes on the remaining nodes, allowing the continuity of the application and taking advantage of the computing accomplished previously. We describe a lot of changes inside source of the LAM/MPI. Experiments with an application for recognizing DNA similarity showed that despite the overhead caused by periodic checkpoints, the benefits can reach about 50% on a small cluster.

Antonio S. Martins, Ronaldo Augusto Lara Gon&ccedi

Real-time Traffic

Checkpoint Mechanism | Distributed And Parallel Computing | Fault Tolerance | Identifies Failed Nodes | IPPS 2007 |

claim paper

Related Content

» Recent advances in checkpointrecovery systems

» Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Lingu...

» Enabling userdriven Checkpointing strategies in Reversemode Automatic Differentiation

» An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging

» DejaVu Transparent UserLevel Checkpointing Migration and Recovery for Distributed Systems

» Transparent Adaptive LibraryBased Checkpointing for MasterWorker Style Parallelism

» Implementing faulttolerance in realtime systems by automatic program transformations

» Coordinated Checkpoint versus Message Log for Fault Tolerant MPI

» Mementos system support for longrunning computation on RFIDscale devices

Post Info
More Details (n/a)

Added	03 Jun 2010
Updated	03 Jun 2010
Type	Conference
Year	2007
Where	IPPS
Authors	Antonio S. Martins, Ronaldo Augusto Lara Gonçalves

Comments (0)