Cooperative checkpointing theory

15 years 7 months ago

Download www.cecs.uci.edu

Cooperative checkpointing uses global knowledge of the state and health of the machine to improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. Using results from cooperative checkpointing theory, this paper proves that periodic checkpointing is not expected to be competitive with the oﬄine optimal. By leveraging probabilistic information about the future, cooperative checkpointing gives ﬂexible algorithms that are optimally competitive. The results prove that simulating periodic checkpointing, by performing only every dth checkpoint, is not competitive with the oﬄine optimal in the worst case; a simple modiﬁcation gives a provably competitive algorithm. Calculations using failure traces from a prototype of IBM’s Blue Gene/L show an application using cooperative checkpointing may make progress 4 times faster than one using periodic checkpointing, under realistic conditions. We contribute an approach to providing la...

Adam J. Oliner, Larry Rudolph, Ramendra K. Sahoo

Real-time Traffic