Clock-related operations are one of the many sources of replica non-determinism and of replica inconsistency in fault-tolerant distributed systems. In passive replication, if the ...
Wenbing Zhao, Louise E. Moser, P. M. Melliar-Smith
Massively parallel computing systems are being built with thousands of nodes. Because of the high number of components, it is critical to keep these systems running even in the pre...
Designing a distributed fault tolerance algorithm requires careful analysis of both fault models and diagnosis strategies. A system will fail if there are too many active faults, ...
The current multiprocessors such asCray T3D support interprocessor communication using partitioned dimension-order routers (PDRs). In a PDR implementation, the routing logic and sw...
Fault-tolerant time-triggered communication relies on the synchronization of local clocks. The startup problem is the problem of reaching a sufficient degree of synchronization a...