Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-pas...
Rajanikanth Batchu, Yoginder S. Dandass, Anthony S...
The number of processors embedded in high performance computing platforms is growing daily to solve larger and more complex problems. The logical network topologies must also suppo...
Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
The idle computers on a local area, campus area, or even wide area network represent a significant computational resource--one that is, however, also unreliable, heterogeneous, an...
Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore...