Sciweavers

1256 search results - page 16 / 252
» On Coordinated Checkpointing in Distributed Systems
Sort
View
SC
2009
ACM
14 years 3 months ago
FALCON: a system for reliable checkpoint recovery in shared grid environments
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as the performance degradation is tolerable. For gu...
Tanzima Zerin Islam, Saurabh Bagchi, Rudolf Eigenm...
SIGMETRICS
2011
ACM
245views Hardware» more  SIGMETRICS 2011»
12 years 11 months ago
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems
Software bugs that occur in production are often difficult to reproduce in the lab due to subtle differences in the application environment and nondeterminism. To address this pr...
Dinesh Subhraveti, Jason Nieh
PODC
1998
ACM
14 years 1 months ago
Persistent Messages in Local Transactions
: We present a new model for handling messages and state in a distributed application that we call Messages in Local Transactions (MLT). Under this model, messages and data are not...
David E. Lowell, Peter M. Chen
HCW
2000
IEEE
14 years 1 months ago
Reliable Cluster Computing with a New Checkpointing RAID-x Architecture
In a serverless cluster of PCs or workstations, the cluster must allow remote file accesses or parallel I/O directly performed over disks distributed to all client nodes. We intro...
Kai Hwang, Hai Jin, Roy S. C. Ho, Wonwoo Ro
IPPS
2005
IEEE
14 years 2 months ago
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hiberna...
José Carlos Sancho, Fabrizio Petrini, Kei D...