Algorithmic Based Fault Tolerance Applied to High Performance Computing

15 years 6 months ago

Download www.netlib.org

: We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrixmatrix multiplication subroutine and we propose some models to predict its running time. Our

George Bosilca, Remi Delmas, Jack Dongarra, Julien

Real-time Traffic

CORR 2008 | Education | Fault Tolerance | Fault Tolerance Technique | Parallel Distributed Computation |

claim paper

» Transparent Incremental Checkpointing at Kernel Level a Foundation for Fault Tolerance for...

» High performance linpack benchmark a fault tolerant implementation without checkpointing

» Fault tolerant high performance computing by a coding approach

» A FaultTolerant Middleware Architecture for HighAvailability Storage Services

» A QoSaware fault tolerant middleware for dependable service composition

» Comparison of Failure Detectors and Group Membership Performance Study of Two Atomic Broad...

» Algorithmbased checkpointfree fault tolerance for parallel matrix computations on volatile...

» Fault tolerant clockless wave pipeline design

Post Info
More Details (n/a)

Added	09 Dec 2010
Updated	09 Dec 2010
Type	Journal
Year	2008
Where	CORR
Authors	George Bosilca, Remi Delmas, Jack Dongarra, Julien Langou

Comments (0)

Sciweavers

Algorithmic Based Fault Tolerance Applied to High Performance Computing

CORR 2008 | Education | Fault Tolerance | Fault Tolerance Technique | Parallel Distributed Computation |

Explore & Download

Productivity Tools

Sciweavers