Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-pas...
Rajanikanth Batchu, Yoginder S. Dandass, Anthony S...
Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide hug...
ing Abstraction to Improve Fault Tolerance MIGUEL CASTRO Microsoft Research and RODRIGO RODRIGUES and BARBARA LISKOV MIT Laboratory for Computer Science Software errors are a major...
MicroQoSCORBA (MQC) is a middleware platform that focuses on embedded applications by providing a very fine level of configurability of its internal orthogonal components. Using t...
This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in ...