The efficiency of service discovery is critical in the development of fully decentralized middleware intended to manage large scale computational grids. This demand influenced t...
Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly ...
Philip M. Wells, Koushik Chakraborty, Gurindar S. ...
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
In this paper we present an approach to increase the fault tolerance in FlexRay networks by introducing backup nodes to replace defect ECUs (Electronic Control Units). In order to ...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a ...