Sciweavers

EUROPAR
2007
Springer

Persistent Fault-Tolerance for Divide-and-Conquer Applications on the Grid

14 years 7 months ago
Persistent Fault-Tolerance for Divide-and-Conquer Applications on the Grid
Grid applications need to be fault tolerant, malleable, and migratable. In previous work, we have presented orphan saving, an efficient mechanism addressing these issues for divide-and-conquer applications. In this paper, we present a mechanism for writing partial results to checkpoint files, adding the capability to also tolerate the total loss of all processors, and to allow suspending and later resuming an application. Both mechanisms have only negligible overheads in the absence of faults. In the case of faults, the new checkpointing mechanism outperforms orphan saving by 10 % to 15 %. Also, suspending/resuming an application has only little overhead, making our approach very attractive for writing grid applications.
Gosia Wrzesinska, Ana-Maria Oprescu, Thilo Kielman
Added 07 Jun 2010
Updated 07 Jun 2010
Type Conference
Year 2007
Where EUROPAR
Authors Gosia Wrzesinska, Ana-Maria Oprescu, Thilo Kielmann, Henri E. Bal
Comments (0)