This paper describes a new method for providingtransparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of sh...
The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...
Constructing logical machines out of collections of physical machines is a well-known technique for improving the robustness and fault tolerance of distributed systems. We present...
Yair Amir, Brian A. Coan, Jonathan Kirsch, John La...
Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore...
Mesh networks are among the most important interconnection network topologies for large multicomputer systems. Mesh networks perform poorly in tolerating faults in the view of wor...
Jianer Chen, Gaocai Wang, Chuang Lin, Tao Wang, Gu...