Sciweavers

DSN
2005
IEEE

Design Time Reliability Analysis of Distributed Fault Tolerance Algorithms

14 years 4 months ago
Design Time Reliability Analysis of Distributed Fault Tolerance Algorithms
Designing a distributed fault tolerance algorithm requires careful analysis of both fault models and diagnosis strategies. A system will fail if there are too many active faults, especially active Byzantine faults. But, a system will also fail if overly aggressive convictions leave inadequate redundancy. For high reliability, an algorithm’s hybrid fault model and diagnosis strategy must be tuned to the types and rates of faults expected in the real world. We examine this balancing problem for two common types of distributed algorithms: clock synchronization and group membership. We show the importance of choosing a hybrid fault model appropriate for the physical faults expected by considering two clock synchronization algorithms. Three group membership service diagnosis strategies are used to demonstrate the benefit of discriminating between permanent and transient faults. In most cases, the probability of failure is dominated by one fault type. By identifying the dominant cause of...
Elizabeth Latronico, Philip Koopman
Added 24 Jun 2010
Updated 24 Jun 2010
Type Conference
Year 2005
Where DSN
Authors Elizabeth Latronico, Philip Koopman
Comments (0)