An important problem in the ®eld of distributed systems is that of detecting the termination of a distributed computation. Distributed termination detection (DTD) is a dicult p...
Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime ...
The Amoeba group communication system has two unique aspects: (1) it uses a sequencer-based protocol with negative acknowledgements for achieving a total order on all group messag...
Long running applications often need to adapt due to changing requirements or changing environment. Typically, such adaptation is performed by dynamically adding or removing compo...
Fault tolerance is now a primary design constraint for all major microprocessors. One step in determining a processor’s compliance to its failure rate target is measuring the Ar...