Recovery systems must save state before a failure occurs to enable the system to recover from the failure. However, recovery will fail if the recovery system saves any state corrupted by the fault. The frequency and comprehensiveness of how a recovery system saves state has a major effect on how often the recovery system inadvertently saves corrupted state. This paper explores and measures that effect. We measure how often software faults in the application and operating system cause real applications to save corrupted state when using different types of recovery systems. We find that generic recovery techniques, such as checkpointing and logging, work well for faults in the operating system. However, we find that they do not work well for faults in the application because the very actions taken to enable recovery often corrupt the state upon which successful recovery depends.
Subhachandra Chandra, Peter M. Chen