Reducing Recovery Time in a Small Recursively Restartable System

14 years 5 months ago

Download roc.cs.berkeley.edu

We present ideas on how to structure software systems for high availability by considering MTTR/MTTF characteristics of components in addition to the traditional criteria, such as functionality or state sharing. Recursive restartability (RR), a recently proposed technique for achieving high availability, exploits partial restarts at various levels within complex software infrastructures to recover from transient failures and rejuvenate software components. Here we reﬁne the original proposal and apply the RR philosophy to Mercury, a COTS-based satellite ground station that has been in operation for over 2 years. We develop three techniques for transforming component group boundaries such that time-to-recover is reduced, hence increasing system availability. We also further RR by deﬁning the notions of an oracle, restart group and restart policy, while showing how to reason about system properties in terms of restart groups. From our experience with applying RR to Mercury, we draw ...

George Candea, James Cutler, Armando Fox, Rushabh

Real-time Traffic

Computer Networks | DSN 2002 | Recursive Restartability | Restart Groups | Software Systems |

claim paper

Post Info
More Details (n/a)

Added	14 Jul 2010
Updated	14 Jul 2010
Type	Conference
Year	2002
Where	DSN
Authors	George Candea, James Cutler, Armando Fox, Rushabh Doshi, Priyank Garg, Rakesh Gowda

Comments (0)

Sciweavers

Reducing Recovery Time in a Small Recursively Restartable System

Computer Networks | DSN 2002 | Recursive Restartability | Restart Groups | Software Systems |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers