Automated application-level checkpointing of MPI programs

16 years 8 days ago

Download iss.ices.utexas.edu

Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of highpeformance computing platforms. Therefore, computational science applications need to tolerate hardware failures. In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing faulttolerance protocols in the literature are not suitable for implementing this approach. In this paper, we present a suitable protocol, and show how it can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.

Greg Bronevetsky, Daniel Marques, Keshav Pingali,

Real-time Traffic

Computational Science Applications | Distributed And Parallel Computing | Faulty Process Hangs | MPI Library State | PPOPP 2003 |

claim paper

Added	05 Jul 2010
Updated	05 Jul 2010
Type	Conference
Year	2003
Where	PPOPP
Authors	Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill

Sciweavers

Automated application-level checkpointing of MPI programs

Computational Science Applications | Distributed And Parallel Computing | Faulty Process Hangs | MPI Library State | PPOPP 2003 |

Explore & Download

Productivity Tools

Sciweavers