In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied...
Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime ...
As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images ev...
Chao Wang, Frank Mueller, Christian Engelmann, Ste...
Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively stu...
—This paper deals with decentralized, QoS-aware middleware for checkpointing arrangement in Mobile Grid (MoG) computing systems. Checkpointing is more crucial in MoG systems than...
—Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, ...
The simple checkpointing and migration system for UNIX processes as described in the article of Bozyigit and Wasiq [1] can be improved in two ways: First by a technique to checkpo...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-pas...
Rajanikanth Batchu, Yoginder S. Dandass, Anthony S...
If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detec...
Several schemes for checkpointing and rollback recovery have been reported in the literature. In this paper, we analyze some of these schemes under a stochastic model. We have der...