Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

14 years 5 months ago

Download dps.epfl.ch

Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore drive the MTBF of such clusters to unacceptable levels. The software frameworks used for running parallel applications need to be fault-tolerant in order to ensure continued execution despite node failures. We propose an extension to the flow graph based Dynamic Parallel Schedules (DPS) development framework that allows non-trivial parallel applications to pursue their execution despite node failures. The proposed fault-tolerance mechanism relies on a set of backup threads located in the volatile storage of alternate nodes. These backup threads are kept up to date by duplication of the transmitted data objects and periodical checkpointing of thread states. In case of a failure, the current state of the threads that were on the failed node is reconstructed on the backup threads by re-executing operations. The...

Sebastian Gerlach, Roger D. Hersch

Real-time Traffic

Backup Thread | Distributed And Parallel Computing | Execution Despite Node | IPPS 2005 | Node Failures |

claim paper

Post Info
More Details (n/a)

Added	25 Jun 2010
Updated	25 Jun 2010
Type	Conference
Year	2005
Where	IPPS
Authors	Sebastian Gerlach, Roger D. Hersch

Comments (0)

Sciweavers

Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

Backup Thread | Distributed And Parallel Computing | Execution Despite Node | IPPS 2005 | Node Failures |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers