Sciweavers

IPPS
2005
IEEE

Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

14 years 5 months ago
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules
Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore drive the MTBF of such clusters to unacceptable levels. The software frameworks used for running parallel applications need to be fault-tolerant in order to ensure continued execution despite node failures. We propose an extension to the flow graph based Dynamic Parallel Schedules (DPS) development framework that allows non-trivial parallel applications to pursue their execution despite node failures. The proposed fault-tolerance mechanism relies on a set of backup threads located in the volatile storage of alternate nodes. These backup threads are kept up to date by duplication of the transmitted data objects and periodical checkpointing of thread states. In case of a failure, the current state of the threads that were on the failed node is reconstructed on the backup threads by re-executing operations. The...
Sebastian Gerlach, Roger D. Hersch
Added 25 Jun 2010
Updated 25 Jun 2010
Type Conference
Year 2005
Where IPPS
Authors Sebastian Gerlach, Roger D. Hersch
Comments (0)