Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/r...
: We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance techniq...
George Bosilca, Remi Delmas, Jack Dongarra, Julien...
Replication is widely used to improve fault tolerance in distributed and multi-agent systems. In this paper, we present a different point of view on replication in multi-agent syst...
Abstract—Continuously shrinking feature sizes cause an increasing vulnerability of digital circuits. Manufacturing failures and transient faults may tamper the functionality. Aut...
Web services have been pointed as a suitable technology for the development and execution of distributed applications. However, the Web service architecture still lacks facilities...