Hybrid Checkpointing for MPI Jobs in HPC Environments

13 years 10 months ago

Download moss.csc.ncsu.edu

As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. We have designed a hybrid checkpointing technique for MPI tasks of high-performance applications. This technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. Our implementation integrates new BLCR and LAM/MPI features that complement traditional full checkpoints. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints are an order of magnitude larger than overheads on restarts. We further derive qualitative results indicating an optimal balance between full/incremental checkpoints of our novel approach at ...

Chao Wang, Frank Mueller, Christian Engelmann, Ste

Real-time Traffic

Checkpointing | Distributed And Parallel Computing | Hybrid Checkpointing Technique | ICPADS 2010 | Incremental Checkpoints |

claim paper

Post Info
More Details (n/a)

Added	12 Feb 2011
Updated	12 Feb 2011
Type	Journal
Year	2010
Where	ICPADS
Authors	Chao Wang, Frank Mueller, Christian Engelmann, Stephen L. Scott

Comments (0)

Sciweavers

Hybrid Checkpointing for MPI Jobs in HPC Environments

Checkpointing | Distributed And Parallel Computing | Hybrid Checkpointing Technique | ICPADS 2010 | Incremental Checkpoints |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers