The development of dependable software systems is a costly undertaking. Fault tolerance techniques as well as self-repair capabilities usually result in additional system complexi...
Tulip is an overlay for routing, searching and publish-lookup information sharing. It offers a unique combination of the advantages of both structured and unstructured overlays, t...
Ittai Abraham, Ankur Badola, Danny Bickson, Dahlia...
: Intrusion tolerance is a recent approach to deal with intentional and malicious failures. It combines the research on fault tolerance with the research on security, and relies on...
This paper describes the methodology used to add nonintrusive system-level fault tolerance to an electronic throttle controller. The original model of the throttle controller is a...
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented ...
This paper describes the use of fault tolerance in a multiagent system. Such an approach is based on the modeling of autonomous agents with planning capabilities. These capabiliti...
In this paper, we propose the coordinated robust routing (CRR) scheme to address the fault tolerance requirements in the layered wireless sensor networks. In the proposed scheme, ...
Mei Yang, Jianping Wang, Zhen-guo Gao, Yingtao Jia...
Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hiberna...
We observe increasing interest in aggregating geographically distributed, heterogeneous resources to perform large scale computations. MPI remains the most popular programming par...
— Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault ...
Aurelien Bouteiller, Boris Collin, Thomas Hé...