Abstract-- High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message...
Service-Oriented Architecture (SOA) is widely adopted for building mission-critical systems, ranging from on-line stores to complex airline management systems. How to build reliabl...
: We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance techniq...
George Bosilca, Remi Delmas, Jack Dongarra, Julien...
In order to achieve fault tolerance, highly reliable system often require the ability to detect errors as soon as they occur and prevent the speared of erroneous information throu...
Fault tolerance is an important aspect in real-time computing. In real-time control systems, tasks could be faulty due to various reasons. Faulty tasks may compromise the performa...
Xue Liu, Hui Ding, Kihwal Lee, Qixin Wang, Lui Sha
Abstract—Continuously shrinking feature sizes cause an increasing vulnerability of digital circuits. Manufacturing failures and transient faults may tamper the functionality. Aut...
This paper describes a new method for providingtransparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of sh...
This paper deals with multiprocessor systems required to provide both high performance and good figures of dependability attributes. Fault tolerance is pursued through a proper co...
Felicita Di Giandomenico, Silvano Chiaradonna, And...
— With the increasing number of embedded computer systems being used in safety critical applications the testing and assessment of a system’s fault tolerance properties become ...
This paper describes a method of supervised learning based on forward selection branching. This method improves fault tolerance by means of combining information related to general...