The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...
Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures....
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of ...
Greg Bronevetsky, Rohit Fernandes, Daniel Marques,...
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, ...
Checkpointing and rollback recovery is a very effective technique to tolerate transient faults and preventive shutdowns. In the past, most of the checkpointing schemes published i...