Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sens...
In this paper, we argue that the reliability of large-scale storage systems can be significantly improved by using better reliability metrics and more efficient policies for rec...
This paper outlines a radial basis function neural network approach to predict the failures in overhead distribution lines of power delivery systems. The RBF networks are trained ...
Grant Cochenour, Jerad Simon, Sanjoy Das, Anil Pah...
We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as e through the syst...
Mike Y. Chen, Anthony Accardi, Emre Kiciman, David...
Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...