Sciweavers

354 search results - page 7 / 71
» Self Adaptive Application Level Fault Tolerance for Parallel...
Sort
View
115
Voted
PPOPP
2005
ACM
15 years 8 months ago
Fault tolerant high performance computing by a coding approach
As the number of processors in today’s high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the exe...
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julie...
120
Voted
CCGRID
2010
IEEE
15 years 3 months ago
Selective Recovery from Failures in a Task Parallel Programming Model
Abstract--We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tr...
James Dinan, Arjun Singri, P. Sadayappan, Sriram K...
122
Voted
COLCOM
2005
IEEE
15 years 8 months ago
On-demand overlay networking of collaborative applications
We propose a new overlay network, called Generic Identifier Network (GIN), for collaborative nodes to share objects with transactions across affiliated organizations by merging th...
Cheng-Jia Lai, Richard R. Muntz
124
Voted
IPPS
1998
IEEE
15 years 7 months ago
Self-Testing Fault-Tolerant Real-Time Systems
We propose a periodic diagnostic algorithm based on the testing model of computation for real-time systems. The diagnostic task runs on every processor of the system. When the task...
M. Rooholamini, Seyed H. Hosseini
132
Voted
CCGRID
2006
IEEE
15 years 8 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra