Sciweavers

1256 search results - page 11 / 252
» On Coordinated Checkpointing in Distributed Systems
Sort
View
HIPC
2007
Springer
14 years 2 months ago
A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, ...
John Paul Walters, Vipin Chaudhary
IPPS
2009
IEEE
14 years 3 months ago
DMTCP: Transparent checkpointing for cluster computations and the desktop
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wid...
Jason Ansel, Kapil Arya, Gene Cooperman
AP2PS
2009
IEEE
14 years 3 days ago
Algorithm-Based Fault Tolerance Applied to P2P Computing Networks
—P2P computing platforms are subject to a wide range of attacks. In this paper, we propose a generalisation of the previous disk-less checkpointing approach for fault-tolerance i...
Thomas Roche, Mathieu Cunche, Jean-Louis Roch
IPPS
1996
IEEE
14 years 28 days ago
CoCheck: Checkpointing and Process Migration for MPI
Checkpointing of parallel applications can be used as the core technology to provide process migration. Both, checkpointing and migration, are an important issue for parallel appl...
Georg Stellner
DSN
2005
IEEE
14 years 2 months ago
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems
G. John Janakiraman, Jose Renato Santos, Dinesh Su...