High performance linpack benchmark: a fault tolerant implementation without checkpointing

13 years 8 months ago

Download inside.mines.edu

The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all ﬁnished computations after a failure. While checkpointing has been very useful to tolerate failures for a long time, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints and the number of processors is large. In this paper, we propose an algorithm-based recovery scheme for the High Performance Linpack benchmark (which modiﬁes a large amount of memory in each iteration) to tolerate fail-stop failures without checkpointing. It was proved by Huang and Abraham that a checksum added to a matrix will be maintained after the matrix is factored. We demonstrate that, for the right-looking LU factorization alg...

Teresa Davies, Christer Karlsson, Hui Liu, Chong D

Real-time Traffic

Distributed And Parallel Computing | Fault Tolerance | ICS 2011 | Recovery Scheme | Supercomputer Jaguar |

claim paper

» Transparent Incremental Checkpointing at Kernel Level a Foundation for Fault Tolerance for...

» Fault tolerant high performance computing by a coding approach

» MPICHV Project A Multiprotocol Automatic FaultTolerant MPI

» Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

» Algorithmbased recovery for iterative methods without checkpointing

» Combining Partial Redundancy and Checkpointing for HPC

» The Design and Implementation of CheckpointRestart Process Fault Tolerance for Open MPI

» FTCCharm an inmemory checkpointbased fault tolerant runtime for Charm and MPI

Post Info
More Details (n/a)

Added	29 Aug 2011
Updated	29 Aug 2011
Type	Journal
Year	2011
Where	ICS
Authors	Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, Zizhong Chen

Comments (0)

Sciweavers

High performance linpack benchmark: a fault tolerant implementation without checkpointing

Distributed And Parallel Computing | Fault Tolerance | ICS 2011 | Recovery Scheme | Supercomputer Jaguar |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers