With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI’s capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user’s data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be r...
Yuan Tang, Graham E. Fagg, Jack Dongarra