Rebound: scalable checkpointing for coherent shared memory

13 years 6 months ago

Download iacoma.cs.uiuc.edu

As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefﬁciencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the ﬁrst hardware-based scheme for coordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efﬁciency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distr...

Rishi Agarwal, Pranav Garg, Josep Torrellas

Real-time Traffic

Cache Coherence | Directory Protocol | Hardware | ISCA 2011 | Memory Multiprocessors |

claim paper

Post Info
More Details (n/a)

Added	21 Aug 2011
Updated	21 Aug 2011
Type	Journal
Year	2011
Where	ISCA
Authors	Rishi Agarwal, Pranav Garg, Josep Torrellas

Comments (0)

Sciweavers

Rebound: scalable checkpointing for coherent shared memory

Cache Coherence | Directory Protocol | Hardware | ISCA 2011 | Memory Multiprocessors |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers