Sciweavers

1256 search results - page 18 / 252
» On Coordinated Checkpointing in Distributed Systems
Sort
View
SRDS
2003
IEEE
14 years 2 months ago
Raptor: Integrating Checkpoints and Thread Migration for Cluster Management
distributed shared-memory (SDSM) provides the abstraction necessary to run shared-memory applications on cost-effective parallel platforms such as clusters of workstations. Howeve...
Hazim Shafi, Evan Speight, John K. Bennett
GI
2004
Springer
14 years 2 months ago
Crash Management for Distributed Parallel Systems
: With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organ...
Jan Haase, Frank Eschmann
PVM
2005
Springer
14 years 2 months ago
Cooperative Write-Behind Data Buffering for MPI I/O
Many large-scale production parallel programs often run for a very long time and require data checkpoint periodically to save the state of the computation for program restart and/o...
Wei-keng Liao, Kenin Coloma, Alok N. Choudhary, Le...
SOSP
2005
ACM
14 years 5 months ago
Speculative execution in a distributed file system
Speculator provides Linux kernel support for speculative execution. It allows multiple processes to share speculative state by tracking causal dependencies propagated through inte...
Edmund B. Nightingale, Peter M. Chen, Jason Flinn
IPPS
2007
IEEE
14 years 3 months ago
A Fault Tolerance Protocol with Fast Fault Recovery
Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
Sayantan Chakravorty, Laxmikant V. Kalé