Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation

15 years 8 months ago

Download icl.cs.utk.edu

With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI speciﬁcation on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that deﬁne the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI’s capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user’s data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be r...

Yuan Tang, Graham E. Fagg, Jack Dongarra

Real-time Traffic