Cooperative checkpointing uses global knowledge of the state and health of the machine to improve performance and reliability by dynamically deciding when to skip checkpoint reque...
If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detec...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety ...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety o...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
We introduce transactors, a fault-tolerant programming model for composing loosely-coupled distributed components running in an unreliable environment such as the internet into sy...