Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures....
Over the past decade the number of processors in the high performance facilities went up to hundreds of thousands. As a direct consequence, while the computational power follow th...
Aurelien Bouteiller, George Bosilca, Jack Dongarra
This paper addresses the recovery and the rollback problem in distributed collaborative transactions. We propose a solution to the problem in a generalized ARIES [9] framework. We...
We present in this paper an extension of the messagedriven confidence-driven framework that we developed for onboard guarded software upgrading. The purpose of this work is to pr...