Sciweavers

147 search results - page 8 / 30
» Automatic recovery from software failure
Sort
View
SC
2009
ACM
14 years 2 months ago
FALCON: a system for reliable checkpoint recovery in shared grid environments
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as the performance degradation is tolerable. For gu...
Tanzima Zerin Islam, Saurabh Bagchi, Rudolf Eigenm...
SERP
2003
13 years 9 months ago
Performance of Service-Discovery Architectures in Response to Node Failures
Current trends suggest future software systems will rely on service-discovery protocols to combine and recombine distributed services dynamically in reaction to changing condition...
Christopher Dabrowski, Kevin L. Mills, Andrew L. R...
SOSP
2003
ACM
14 years 4 months ago
Improving the reliability of commodity operating systems
Despite decades of research in extensible operating system technology, extensions such as device drivers remain a significant cause of system failures. In Windows XP, for example,...
Michael M. Swift, Brian N. Bershad, Henry M. Levy
IPPS
2007
IEEE
14 years 1 months ago
DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications....
Joseph F. Ruscio, Michael A. Heffner, Srinidhi Var...
DSN
2002
IEEE
14 years 15 days ago
Reducing Recovery Time in a Small Recursively Restartable System
We present ideas on how to structure software systems for high availability by considering MTTR/MTTF characteristics of components in addition to the traditional criteria, such as...
George Candea, James Cutler, Armando Fox, Rushabh ...