As the number of processors in today’s high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the exe...
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julie...
Group communication protocols are used in fault-tolerant systems to maintain strong replica consistency. The FaultTolerant Multicast Protocol (FTMP) described here is a group comm...
Louise E. Moser, P. M. Melliar-Smith, Ruppert R. K...
The reliability of future processors is threatened by decreasing transistor robustness. Current architectures focus on delivering high performance at low cost; lifetime device rel...
Andrea Pellegrini, Joseph L. Greathouse, Valeria B...
Transient faults due to particle strikes are a key challenge in microprocessor design. Driven by exponentially increasing transistor counts, per-chip faults are a growing burden. ...
Kristen R. Walcott, Greg Humphreys, Sudhanva Gurum...
Fault tolerant distributed protocols typically utilize a homogeneous fault model, either fail-crash or fail-Byzantine, where all processors are assumed to fail in the same manner....