BAD-check: bulk asynchronous distributed checkpointing

10 years 3 months ago

Download www.pdsw.org

Leadership-scale scientiﬁc simulations running as tens of thousands of tightly-coupled MPI processes are vulnerable to interruption due to a single process or node failure. Due to the dependence of each state calculation on the successful completion of each of the prior state calculations, checkpointrestart is the most widely-used technique to achieve fault tolerance. To write a consistent view of distributed state as a checkpoint, applications typically synchronize and pause while writing data to persistent media. In this paper we present a transactional protocol that enables asynchronous distributed creation of checkpoint data sets, and describe the conditions under which it is beneﬁcial. With simulations, we demonstrate that scientiﬁc applications exhibiting computational variance without frequent synchronization can use our protocol to either reduce run time by up to 27% or reduce required storage system capability by up to 40%.

John Bent, Brad Settlemyer, Haiyun Bao, Sorin Faib

Real-time Traffic

Applied Computing | SC 2015 |

claim paper

Post Info
More Details (n/a)

Added	17 Apr 2016
Updated	17 Apr 2016
Type	Journal
Year	2015
Where	SC
Authors	John Bent, Brad Settlemyer, Haiyun Bao, Sorin Faibish, Jeremy Sauer, Jingwang Zhang

Comments (0)

Sciweavers

BAD-check: bulk asynchronous distributed checkpointing

Applied Computing | SC 2015 |

Explore & Download

Productivity Tools

Sciweavers