Enhancing application robustness through adaptive fault tolerance

16 years 1 months ago

Download www.cs.iit.edu

As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adaptive fault tolerance system for HPC applications. It aims to enable parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. Both prior and ongoing work are summarized in this paper.

Zhiling Lan, Yawei Li, Ziming Zheng, Prashasta Guj

Real-time Traffic

Adaptive Fault Tolerance | Application Fault Resilience | Distributed And Parallel Computing | Hpc Applications | IPPS 2008 |

claim paper

» ROAFTS A Middleware Architecture for RealTime ObjectOriented Adaptive Fault Tolerance Supp...

» Providing FaultTolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Rep...

» CIFTS A Coordinated Infrastructure for FaultTolerant Systems

» Formal Modelling and Analysis of Business Information Applications with Fault Tolerant Mid...

» ORTEGA An Efficient and Flexible Software Fault Tolerance Architecture for RealTime Contro...

» TamperTolerant Software Modeling and Implementation

» Formal Development of Reactive Fault Tolerant Systems

» Fault Tolerant Planning for Critical Robots

Post Info
More Details (n/a)

Added	31 May 2010
Updated	31 May 2010
Type	Conference
Year	2008
Where	IPPS
Authors	Zhiling Lan, Yawei Li, Ziming Zheng, Prashasta Gujrati

Comments (0)

Sciweavers

Enhancing application robustness through adaptive fault tolerance

Adaptive Fault Tolerance | Application Fault Resilience | Distributed And Parallel Computing | Hpc Applications | IPPS 2008 |

Explore & Download

Productivity Tools

Sciweavers