A new approach is proposed that exploits repetition inherent in programs to provide low-overhead transient fault protection in a processor. Programs repeatedly execute the same in...
Performing experimental evaluation of fault tolerant distributed systems is a complex and tedious task, and automating as much as possible of the execution and evaluation of exper...
CX, a network-based computational exchange, is presented. The system’s design integrates variations of ideas from other researchers, such as work stealing, non-blocking tasks, e...
In this paper we consider general simulations of algorithms designed for fully operational BSP and CGM machines on machines with faulty processors. The faults are deterministic (i...
A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant progr...