Sciweavers

2226 search results - page 27 / 446
» Fault-Tolerant Parallel Applications with Dynamic Parallel S...
Sort
View
ICPP
2007
IEEE
14 years 2 months ago
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
Yawei Li, Prashasta Gujrati, Zhiling Lan, Xian-He ...
ICPPW
2009
IEEE
13 years 5 months ago
Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System
Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime ...
Harish Gapanati Naik, Rinku Gupta, Pete Beckman
CCGRID
2006
IEEE
14 years 1 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra
IPPS
2008
IEEE
14 years 2 months ago
Large-scale experiment of co-allocation strategies for Peer-to-Peer supercomputing in P2P-MPI
High Performance computing generally involves some parallel applications to be deployed on the multiples resources used for the computation. The problem of scheduling the applicat...
Stéphane Genaud, Choopan Rattanapoka