Abstract. An important step in achieving robustness to run-time faults is the ability to detect and repair problems when they arise in a running system. Effective fault detection a...
Paulo Casanova, Bradley R. Schmerl, David Garlan, ...
Software testing and software fault tolerance are two major techniques for developing reliable software systems, yet limited empirical data are available in the literature to eval...
Michael R. Lyu, Zubin Huang, Sam K. S. Sze, Xia Ca...
: We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance techniq...
George Bosilca, Remi Delmas, Jack Dongarra, Julien...
The proposed software technique is a very low cost and an effective solution towards designing Byzantine fault tolerant computing application systems that are not so safety critic...
— This paper presents a distributed version of our previous work, called SAFDetection, which is a sensor analysisbased fault detection approach that is used to monitor tightlycou...