The shrinking processor feature size, lower threshold voltage and increasing clock frequency make modern processors highly vulnerable to transient faults. Architectural Vulnerabil...
Concentration of design effort for current single-chip Commercial-Off-The-Shelf (COTS) microprocessors has been directed towards performance. Reliability has not been the primary ...
Due to the ever-widening performance gap between processors and disks, I/O operations tend to become the major performance bottleneck of data-intensive applications on modern clus...
Yifeng Zhu, Hong Jiang, Xiao Qin, Dan Feng, David ...
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At...
Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, ...
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...