Interconnect agnostic checkpoint/restart in open MPI

14 years 6 months ago

Download www.osl.iu.edu

Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passing Interface (MPI) level transparent checkpoint/restart fault tolerance is an appealing option to HPC application developers that do not wish to restructure their code. Historically, MPI implementations that provided this option have struggled to provide a full range of interconnect support, especially shared memory support. This paper presents a new approach for implementing checkpoint/restart coordination algorithms that allows the MPI implementation of checkpoint/restart to be interconnect agnostic. This approach allows an application to be checkpointed on one set of interconnects (e.g., InﬁniBand and shared memory) and be restarted with a different set of interconnects (e.g., Myrinet and shared memory or Ethernet). By separating the network interconnect details from the checkpoint/restart coordination a...

Joshua Hursey, Timothy Mattox, Andrew Lumsdaine

Real-time Traffic

Checkpoint/restart Coordination Algorithm | Distributed Computing | Hpc Applications | HPDC 2009 | Interconnect |

claim paper

Post Info
More Details (n/a)

Added	21 May 2010
Updated	21 May 2010
Type	Conference
Year	2009
Where	HPDC
Authors	Joshua Hursey, Timothy Mattox, Andrew Lumsdaine

Comments (0)

Sciweavers

Interconnect agnostic checkpoint/restart in open MPI

Checkpoint/restart Coordination Algorithm | Distributed Computing | Hpc Applications | HPDC 2009 | Interconnect |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers