Testing is a difficult process that becomes more difficult with scaling. With smaller and faster devices, tolerance for errors shrinks and devices may act correctly under certain condition and not under others. As such, hard errors may exist but are only exercised by very specific machine state and signal pathways. Targeting these errors is difficult, and creating test cases that cover all machine states and pathways is not possible. In addition, new complications during burn-in may mean latent hard errors are not exposed in the fab and reach the customer before becoming active. To address this problem, we propose an architecture we call BlackJack that allows hard errors to be detected using redundant threads running on a single SMT core. This technique provides a safety-net that catches hard errors that were either latent during test or just not covered by the test cases at all. Like SRT, our technique works by executing redundant copies and verifying that their resulting machine sta...
Ethan Schuchman, T. N. Vijaykumar