Sampling + DMR: practical and low-overhead permanent fault detection

13 years 6 months ago

Download www.cs.wisc.edu

With technology scaling, manufacture-time and in-ﬁeld permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. SamplingDMR thus introduces a system paradigm of restricting all permanent faults’ effects to small ﬁnite windows of error occurrence. We prove an ultimate upper bound exists on total...

Shuou Nomura, Matthew D. Sinclair, Chen-Han Ho, Ve

Real-time Traffic

Computer Systems Organization | Hardware | ISCA 2011 | Periodic Execution | Tolerance C |

claim paper

Post Info
More Details (n/a)

Added	21 Aug 2011
Updated	21 Aug 2011
Type	Journal
Year	2011
Where	ISCA
Authors	Shuou Nomura, Matthew D. Sinclair, Chen-Han Ho, Venkatraman Govindaraju, Marc de Kruijf, Karthikeyan Sankaralingam

Comments (0)

Sciweavers

Sampling + DMR: practical and low-overhead permanent fault detection

Computer Systems Organization | Hardware | ISCA 2011 | Periodic Execution | Tolerance C |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers