Sciweavers

IPPS
2008
IEEE

Enhancing application robustness through adaptive fault tolerance

14 years 6 months ago
Enhancing application robustness through adaptive fault tolerance
As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adaptive fault tolerance system for HPC applications. It aims to enable parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. Both prior and ongoing work are summarized in this paper.
Zhiling Lan, Yawei Li, Ziming Zheng, Prashasta Guj
Added 31 May 2010
Updated 31 May 2010
Type Conference
Year 2008
Where IPPS
Authors Zhiling Lan, Yawei Li, Ziming Zheng, Prashasta Gujrati
Comments (0)