Sciweavers

1256 search results - page 7 / 252
» On Coordinated Checkpointing in Distributed Systems
Sort
View
IPPS
2006
IEEE
14 years 2 months ago
Recent advances in checkpoint/recovery systems
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of ...
Greg Bronevetsky, Rohit Fernandes, Daniel Marques,...
HPDC
2007
IEEE
14 years 3 months ago
Peer-to-peer checkpointing arrangement for mobile grid computing systems
This paper deals with a novel, distributed, QoS-aware, peer-topeer checkpointing arrangement component for mobile Grid (MoG) computing systems middleware. Checkpointing is more cr...
Paul J. Darby III, Nian-Feng Tzeng
ICDCS
2012
IEEE
11 years 11 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...
CLOUDCOM
2010
Springer
13 years 6 months ago
REMEM: REmote MEMory as Checkpointing Storage
Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively stu...
Hui Jin, Xian-He Sun, Yong Chen, Tao Ke
CLUSTER
2004
IEEE
13 years 8 months ago
MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-pas...
Rajanikanth Batchu, Yoginder S. Dandass, Anthony S...