Sciweavers

342 search results - page 12 / 69
» A planning based approach to failure recovery in distributed...
Sort
View
SRDS
1994
IEEE
13 years 11 months ago
Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sens...
G. Janakiraman, Yuval Tamir
SRDS
2008
IEEE
14 years 1 months ago
Dynamically Quantifying and Improving the Reliability of Distributed Storage Systems
In this paper, we argue that the reliability of large-scale storage systems can be significantly improved by using better reliability metrics and more efficient policies for rec...
Rekha Bachwani, Leszek Gryz, Ricardo Bianchini, Ce...
GECCO
2005
Springer
155views Optimization» more  GECCO 2005»
14 years 1 months ago
A pareto archive evolutionary strategy based radial basis function neural network training algorithm for failure rate prediction
This paper outlines a radial basis function neural network approach to predict the failures in overhead distribution lines of power delivery systems. The RBF networks are trained ...
Grant Cochenour, Jerad Simon, Sanjoy Das, Anil Pah...
NSDI
2004
13 years 9 months ago
Path-Based Failure and Evolution Management
We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as e through the syst...
Mike Y. Chen, Anthony Accardi, Emre Kiciman, David...
PVM
2010
Springer
13 years 6 months ago
Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols
Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...