Sciweavers

116 search results - page 4 / 24
» A Communication Framework for Fault-Tolerant Parallel Execut...
Sort
View
CLUSTER
2004
IEEE
13 years 11 months ago
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection a...
Pierre Lemarinier, Aurelien Bouteiller, Thomas H&e...
CCGRID
2006
IEEE
14 years 1 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra
GCC
2004
Springer
14 years 25 days ago
State Management Issues and Grid Services
Defining the ways for components around the world to collaborate with each other to execute applications over the internet is one of the biggest challenges for computer scientists...
Yong Xie, Yong Meng Teo
PVM
2010
Springer
13 years 5 months ago
Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols
Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...
FTDCS
1999
IEEE
13 years 11 months ago
Group Communication in Real-Time Computing Systems: Issues and Directions
: Group communication in real-time computing systems has been a subject of research for almost two decades but it is not yet a mature technological field. The purpose of this paper...
K. H. Kim