Evaluating cooperative checkpointing for supercomputing systems

16 years 18 days ago

Download adam.oliner.net

Cooperative checkpointing, in which the system dynamically skips checkpoints requested by applications at runtime, can exploit system-level information to improve performance and reliability in the face of failures. We evaluate the applicability of cooperative checkpointing to large-scale systems through simulation studies considering real workloads, failure logs, and different network topologies. We consider two cooperative checkpointing algorithms: work-based cooperative checkpointing uses a heuristic based on the amount of unsaved work and risk-based cooperative checkpointing leverages failure event prediction. Our results demonstrate that, compared to periodic checkpointing, riskbased checkpointing with event prediction accuracy as low as 10% is able to signiﬁcantly improve system utilization and reduce average bounded slowdown by a factor of 9, without losing any additional work to failures. Similarly, work-based checkpointing conferred tremendous performance beneﬁts in the f...

Adam J. Oliner, Ramendra K. Sahoo

Real-time Traffic

Cooperative Checkpointing | Distributed And Parallel Computing | IPPS 2006 | Risk-based Cooperative Checkpointing | Work-based Cooperative Checkpointing |

claim paper

Added	12 Jun 2010
Updated	12 Jun 2010
Type	Conference
Year	2006
Where	IPPS
Authors	Adam J. Oliner, Ramendra K. Sahoo

Sciweavers

Evaluating cooperative checkpointing for supercomputing systems

Cooperative Checkpointing | Distributed And Parallel Computing | IPPS 2006 | Risk-based Cooperative Checkpointing | Work-based Cooperative Checkpointing |

Explore & Download

Productivity Tools

Sciweavers