Sciweavers

354 search results - page 7 / 71
» Self Adaptive Application Level Fault Tolerance for Parallel...
Sort
View
PPOPP
2005
ACM
14 years 2 months ago
Fault tolerant high performance computing by a coding approach
As the number of processors in today’s high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the exe...
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julie...
CCGRID
2010
IEEE
13 years 9 months ago
Selective Recovery from Failures in a Task Parallel Programming Model
Abstract--We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tr...
James Dinan, Arjun Singri, P. Sadayappan, Sriram K...
COLCOM
2005
IEEE
14 years 2 months ago
On-demand overlay networking of collaborative applications
We propose a new overlay network, called Generic Identifier Network (GIN), for collaborative nodes to share objects with transactions across affiliated organizations by merging th...
Cheng-Jia Lai, Richard R. Muntz
IPPS
1998
IEEE
14 years 25 days ago
Self-Testing Fault-Tolerant Real-Time Systems
We propose a periodic diagnostic algorithm based on the testing model of computation for real-time systems. The diagnostic task runs on every processor of the system. When the task...
M. Rooholamini, Seyed H. Hosseini
CCGRID
2006
IEEE
14 years 2 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra