Sciweavers

ISCA
2003
IEEE

Transient-Fault Recovery for Chip Multiprocessors

14 years 4 months ago
Transient-Fault Recovery for Chip Multiprocessors
To address the increasing susceptibility of commodity chip multiprocessors (CMPs) to transient faults, we propose Chiplevel Redundantly Threaded multiprocessor with Recovery (CRTR). CRTR extends the previously-proposed CRT for transient-fault detection in CMPs, and the previously-proposed SRTR for transient-fault recovery in SMT. All these schemes achieve fault tolerance by executing and comparing two copies, called leading and trailing threads, of a given application. Previous recovery schemes for SMT do not perform well on CMPs. In a CMP, the leading and trailing threads execute on different processors to achieve load balancing and reduce the probability of a fault corrupting both threads; whereas in an SMT, both threads execute on the same processor. The inter-processor communication required to compare the threads introduces latency and bandwidth problems not present in an SMT. To hide inter-processor latency, CRTR executes the leading thread ahead of the trailing thread by mainta...
Mohamed A. Gomaa, Chad Scarbrough, Irith Pomeranz,
Added 04 Jul 2010
Updated 04 Jul 2010
Type Conference
Year 2003
Where ISCA
Authors Mohamed A. Gomaa, Chad Scarbrough, Irith Pomeranz, T. N. Vijaykumar
Comments (0)