FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

15 years 10 months ago

Download charm.cs.uiuc.edu

As high performance clusters continue to grow in size, the mean time between failure shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charm++, a fault-tolerant runtime based on a scheme for fast and scalable inmemory checkpoint and restart. At restart, the program can continue to run on the remaining processors without perfor...

Gengbin Zheng, Lixia Shi, Laxmikant V. Kalé

Real-time Traffic

CLUSTER 2004 | Distributed And Parallel Computing | Memory Footprint | Scalable Inmemory Checkpoint | Traditional Disk-based Method |

claim paper

Post Info
More Details (n/a)

Added	20 Aug 2010
Updated	20 Aug 2010
Type	Conference
Year	2004
Where	CLUSTER
Authors	Gengbin Zheng, Lixia Shi, Laxmikant V. Kalé

Comments (0)

Sciweavers

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER 2004 | Distributed And Parallel Computing | Memory Footprint | Scalable Inmemory Checkpoint | Traditional Disk-based Method |

Explore & Download

Productivity Tools

Sciweavers