This paper proposes a lightweight checkpointing scheme for real-time embedded systems. The goal is to separate concerns by allowing applications to take checkpoints independently ...
High availability is a critical feature for service clusters and cloud computing, and is often considered more valuable than performance. One commonly used technique to enhance the...
As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images ev...
Chao Wang, Frank Mueller, Christian Engelmann, Ste...
Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execut...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a ...