—Future computing systems will feature many cores that run fast, but might show more faults compared to existing CMOS technologies. New software methodologies must be adopted to utilize communication bandwidth and the computational power of few slow, reliable cores that could be employed in such systems to verify the results of the fast, faulty cores. Employing the traditional Triple Module Redundancy (TMR) at core instruction level would not be as effective due to its blind replication of computations. We propose two software development methods that utilize what we call Smart TMR (STMR) and fingerprinting to statistically monitor the results of computations and selectively replicate computations that exhibit faults. Experimental results show significant speedup and reliability improvement over traditional TMR approaches.
Hamid Safizadeh, Mohammad Tahghighi, Ehsan K. Arde