This paper describes the design and implementation of SecondSite, a cloud-based service for disaster tolerance. SecondSite extends the Remus virtualization-based high availability...
Shriram Rajagopalan, Brendan Cully, Ryan O'Connor,...
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which fai...
stractions underlying distributed computing. We attempted to keep our preaims at an abstract and general level. In this column, we make those claims more concrete. More precisely, ...
Failure detection is a difficult and often expensive task. The principle of self-healing addresses this cost issue, but poses new research questions. This work focuses on detectin...
The growing interest in ad hoc wireless network applications that are made of large and dense populations of lightweight system resources calls for scalable approaches to fault to...
The rapid growth of the Web has made it possible to build collaborative applications on an unprecedented scale. However, the request-reply interaction model of HTTP limits the rang...
Fault-tolerant distributed systems based on fieldbuses may benefit to a great extent from the availabilityof semantically rich communication services,such as those provided by g...
The increasing complexity of today’s systems makes fast and accurate failure detection essential for their use in mission-critical applications. Various monitoring methods provi...
We present a modular redesign of TrustedPals, a smartcard-based security framework for solving secure multiparty computation (SMC)[?]. TrustedPals allows to reduce SMC to the probl...
We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detect...