Search Sciweavers | Sciweavers

31 search results - page 3 / 7

» The Design and Implementation of Checkpoint Restart Process ...

click to vote

CCGRID
2006
IEEE

131views Distributed And Parallel Com...» more CCGRID 2006»

Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation

14 years 3 months ago

Download icl.cs.utk.edu

With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...

Yuan Tang, Graham E. Fagg, Jack Dongarra

claim paper

Read More »

click to vote

ICDCS
2012
IEEE

238views Distributed And Parallel Com...» more ICDCS 2012»

Combining Partial Redundancy and Checkpointing for HPC

12 years 7 hour ago

Download moss.csc.ncsu.edu

Today’s largest High Performance Computing (HPC) systems exceed one Petaﬂops (1015 ﬂoating point operations per second) and exascale systems are projected within seven years...

James Elliott, Kishor Kharbas, David Fiala, Frank ...

claim paper

Read More »

click to vote

IJHPCA
2006

117views more IJHPCA 2006»

MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI

13 years 9 months ago

Download www.cs.utk.edu

Abstract-- High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message...

Aurelien Bouteiller, Thomas Hérault, G&eacu...

claim paper

Read More »

click to vote

CLOUDCOM
2010
Springer

142views Distributed And Parallel Com...» more CLOUDCOM 2010»

REMEM: REmote MEMory as Checkpointing Storage

13 years 7 months ago

Download ft.ornl.gov

Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively stu...

Hui Jin, Xian-He Sun, Yong Chen, Tao Ke

claim paper

Read More »

click to vote

USENIX
2007

102views Operating System» more USENIX 2007»

Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems

13 years 12 months ago

Download www.ncl.cs.columbia.edu

The ability to checkpoint a running application and restart it later can provide many useful beneﬁts including fault recovery, advanced resources sharing, dynamic load balancing...

Oren Laadan, Jason Nieh

claim paper

Read More »

« Prev « First page 3 / 7 Last » Next »

Sciweavers

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers