There has recently been increasing interests in using system virtualization to improve the dependability of HPC cluster systems. However, it is not cost-free and may come with som...
Haibo Chen, Rong Chen, Fengzhe Zhang, Binyu Zang, ...
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
PC-clusters with high performance/cost ratio have been one of the typical platforms for high performance computing. To lower costs, Gigabit Ethernet is often used for intercommuni...
Abstract. Information Services are fundamental blocks of the Grid infrastructure. They are responsible for collecting and distributing information about resource availability and s...
Diego Puppin, Stefano Moncelli, Ranieri Baraglia, ...
This paper deals with tolerance to timing faults in time-constrained systems. TAFT (Time Aware Fault-Tolerant) is a recently devised approach which applies tolerance to timing vio...
F. Sandrini, Felicita Di Giandomenico, Andrea Bond...