Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

16 years 4 days ago

Download adam.oliner.net

Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a common technique for mitigating the amount of work lost due to job failures, but its effectiveness under realistic circumstances has not been studied. In this paper, we analyze the system-level performance of periodic application checkpointing using parameters similar to those projected for BlueGene/L systems. Our results reﬂect simulations on a toroidal interconnect architecture, using a real job log from a machine similar to BlueGene/L, and with a real failure distribution from a large-scale cluster. Our simulation studies investigate the impact of parameters such as checkpoint overhead and checkpoint interval on a number of performance metrics, including bounded slowdown, system utilization, and total work lost. The results suggest that periodic checkpointing may not be an effective way to improve the average...

Adam J. Oliner, Ramendra K. Sahoo, José E.

Real-time Traffic

Distributed And Parallel Computing | Hardware Failures | IPPS 2005 | Job Failures | Periodic Application Checkpointing |

claim paper

» Lossless compression for large scale cluster logs

» Accelerating Checkpoint Operation by NodeLevel Write Aggregation on Multicore Systems

» Coordinated Checkpoint versus Message Log for Fault Tolerant MPI

» Combining Partial Redundancy and Checkpointing for HPC

» Faulttolerant stream processing using a distributed replicated file system

» Provisioning a Multitiered Data Staging Area for ExtremeScale Machines

» A LocatingFirst Approach for Scalable Overlay Multicast

Post Info
More Details (n/a)

Added	25 Jun 2010
Updated	25 Jun 2010
Type	Conference
Year	2005
Where	IPPS
Authors	Adam J. Oliner, Ramendra K. Sahoo, José E. Moreira, Meeta Sharma Gupta

Comments (0)

Sciweavers

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

Distributed And Parallel Computing | Hardware Failures | IPPS 2005 | Job Failures | Periodic Application Checkpointing |

Explore & Download

Productivity Tools

Sciweavers