Sciweavers

668 search results - page 6 / 134
» Implementing and Evaluating Automatic Checkpointing
Sort
View
SSS
2010
Springer
143views Control Systems» more  SSS 2010»
13 years 5 months ago
Lightweight Live Migration for High Availability Cluster Service
High availability is a critical feature for service clusters and cloud computing, and is often considered more valuable than performance. One commonly used technique to enhance the...
Bo Jiang, Binoy Ravindran, Changsoo Kim
EDCC
2005
Springer
14 years 7 days ago
Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF
This paper presents an implementation of several consistent protocols at the abstract device level and their performance comparison. We have performed experiments using three NAS P...
Namyoon Woo, Hyungsoo Jung, Dongin Shin, Hyuck Han...
GI
2004
Springer
14 years 2 days ago
Crash Management for Distributed Parallel Systems
: With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organ...
Jan Haase, Frank Eschmann
CLUSTER
2007
IEEE
14 years 1 months ago
Evaluation of fault-tolerant policies using simulation
— Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechan...
Anand Tikotekar, Geoffroy Vallée, Thomas Na...
IPPS
1999
IEEE
13 years 11 months ago
An Efficient Logging Algorithm for Incremental Replay of Message
To support incremental replay of message-passing applications, processes must periodically checkpoint and the content of some messages must be logged, to break dependencies of the...
Franco Zambonelli