Sciweavers

1256 search results - page 8 / 252
» On Coordinated Checkpointing in Distributed Systems
Sort
View
SRDS
1999
IEEE
14 years 1 months ago
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging
Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures....
Kuo-Feng Ssu, Bin Yao, W. Kent Fuchs
SRDS
1999
IEEE
14 years 1 months ago
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems
Software distributed shared memory (DSM) improves the programmability of message-passing machines and workclusters by providing a shared memory abstract (i.e., a coherent global a...
Angkul Kongmunvattana, Nian-Feng Tzeng
ICPP
2009
IEEE
14 years 3 months ago
Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems
—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for lar...
Xiangyong Ouyang, Karthik Gopalakrishnan, Dhabales...
ICECCS
1997
IEEE
92views Hardware» more  ICECCS 1997»
14 years 29 days ago
Cache based fault recovery for distributed systems
No cache based techniques for roll-forward fault recovery exist at present. A split-cache approach is proposed that provides e cient support for checkpointing and roll-forward fau...
Avi Mendelson, Neeraj Suri
GRID
2004
Springer
14 years 2 months ago
Checkpoint and Restart for Distributed Components in XCAT3
With the advent of Grid computing, more and more highend computational resources become available for use to a scientist. While this opens up new avenues for scientific research,...
Sriram Krishnan, Dennis Gannon