The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...
Distributed information systems are critical to the functioning of many businesses; designing them to be dependable is a challenging but important task. We report our experience i...
Jeremy Bryans, John S. Fitzgerald, Alexander Roman...
We study the completion time of broadcast operations on static ad hoc wireless networks in presence of unpredictable and dynamical faults. Concerning oblivious fault-tolerant dist...
Andrea E. F. Clementi, Angelo Monti, Riccardo Silv...
We initiate an investigation of general fault-tolerant distributed computation in the full-information model. In the full information model no restrictions are made on the computat...
— Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault ...
Aurelien Bouteiller, Boris Collin, Thomas Hé...