checkpointing | Sciweavers

51

HPDC
2011
IEEE

236views Distributed And Parallel Com...» more HPDC 2011»

Algorithm-based recovery for iterative methods without checkpointing

13 years 4 months ago

In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied...

Zizhong Chen

claim paper

Read More »

33

click to vote

ICPPW
2009
IEEE

132views Distributed And Parallel Com...» more ICPPW 2009»

Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System

13 years 10 months ago

Download www.mcs.anl.gov

Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime ...

Harish Gapanati Naik, Rinku Gupta, Pete Beckman

claim paper

Read More »

40

click to vote

ICPADS
2010
IEEE

169views Distributed And Parallel Com...» more ICPADS 2010»

Hybrid Checkpointing for MPI Jobs in HPC Environments

13 years 10 months ago

Download moss.csc.ncsu.edu

As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images ev...

Chao Wang, Frank Mueller, Christian Engelmann, Ste...

claim paper

Read More »

35

click to vote

CLOUDCOM
2010
Springer

142views Distributed And Parallel Com...» more CLOUDCOM 2010»

REMEM: REmote MEMory as Checkpointing Storage

13 years 10 months ago

Download ft.ornl.gov

Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively stu...

Hui Jin, Xian-He Sun, Yong Chen, Tao Ke

claim paper

Read More »

40

click to vote

TMC
2010

143views more TMC 2010»

Decentralized QoS-Aware Checkpointing Arrangement in Mobile Grid Computing

13 years 10 months ago

Download people.cs.vt.edu

—This paper deals with decentralized, QoS-aware middleware for checkpointing arrangement in Mobile Grid (MoG) computing systems. Checkpointing is more crucial in MoG systems than...

Paul J. Darby III, Nian-Feng Tzeng

claim paper

Read More »

33

click to vote

TPDS
1998

135views more TPDS 1998»

On Coordinated Checkpointing in Distributed Systems

14 years 2 days ago

Download mcn.cse.psu.edu

—Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, ...

Guohong Cao, Mukesh Singhal

claim paper

Read More »

30

click to vote

SIGOPS
2002

74views more SIGOPS 2002»

Comments on "transparent user-level process checkpoint and restore for migration" by Bozyigit and Wasiq

14 years 2 days ago

Download www.cs.inf.ethz.ch

The simple checkpointing and migration system for UNIX processes as described in the article of Bozyigit and Wasiq [1] can be improved in two ways: First by a technique to checkpo...

Felix Rauch, Thomas Stricker

claim paper

Read More »

52

click to vote

CLUSTER
2004
IEEE

103views Distributed And Parallel Com...» more CLUSTER 2004»

MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

14 years 9 days ago

Download www.cse.msstate.edu

Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-pas...

Rajanikanth Batchu, Yoginder S. Dandass, Anthony S...

claim paper

Read More »

34

click to vote

JPDC
2007

95views more JPDC 2007»

Self-stabilizing algorithm for checkpointing in a distributed system

14 years 10 days ago

Download www.isical.ac.in

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detec...

Partha Sarathi Mandal, Krishnendu Mukhopadhyaya

claim paper

Read More »

38

click to vote

JPDC
2006

104views more JPDC 2006»

Performance analysis of different checkpointing and recovery schemes using stochastic model

14 years 12 days ago

Download www.isical.ac.in

Several schemes for checkpointing and rollback recovery have been reported in the literature. In this paper, we analyze some of these schemes under a stochastic model. We have der...

Partha Sarathi Mandal, Krishnendu Mukhopadhyaya

claim paper

Read More »

Sciweavers

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers