As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include ...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a ...
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At...
Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, ...
This paper describes a checkpointing mechanism destined for Distributed Shared Memory (DSM) systems with speculative prefetching. Speculation is a general technique involving predi...
Arkadiusz Danilecki, Anna Kobusinska, Michal Szych...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sens...