Algorithm-based recovery for iterative methods without checkpointing

13 years 10 months ago

Download inside.mines.edu

In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable overhead especially when computations reach petascale and beyond. In this paper, we show that, for many iterative methods, if the parallel data partitioning scheme satisﬁes certain conditions, the iterative methods themselves will maintain enough inherent redundant information for the accurate recovery of the lost data without checkpointing. We analyze the block row data partitioning scheme for sparse matrices and derive a suﬃcient condition for recovering the critical data without checkpointing. When this suﬃcient condition is satisﬁed, neither checkpoint nor rollback is necessary for the recovery. Furthermore, the fault tolerance overhead (time) is zero if no actual failures occur during a program execution. Overhead is introduced ...

Zizhong Chen

Real-time Traffic