Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all n computers are up and running, we would like the load to be evenly distr...
As the desire of scientists to perform ever larger computations drives the size of today’s high performance computers from hundreds, to thousands, and even tens of thousands of ...
In this paper, we use random-selection protocols in the full-information model to solve classical problems in distributed computing. Our main results are the following: • An O(l...
Shafi Goldwasser, Elan Pavlov, Vinod Vaikuntanatha...
In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our ...
- Triple Modular Redundancy is widely used in dependable systems design to ensure high reliability against soft errors. Conventional TMR is effective in protecting sequential circu...
Wei Chen, Rui Gong, Fang Liu, Kui Dai, Zhiying Wan...