Sciweavers

354 search results - page 4 / 71
» Self Adaptive Application Level Fault Tolerance for Parallel...
Sort
View
ISPA
2004
Springer
14 years 1 months ago
Highly Reliable Linux HPC Clusters: Self-Awareness Approach
Abstract. Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration change...
Chokchai Leangsuksun, Tong Liu, Yudan Liu, Stephen...
IPPS
2007
IEEE
14 years 2 months ago
DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications....
Joseph F. Ruscio, Michael A. Heffner, Srinidhi Var...
CORR
2008
Springer
134views Education» more  CORR 2008»
13 years 8 months ago
Algorithmic Based Fault Tolerance Applied to High Performance Computing
: We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance techniq...
George Bosilca, Remi Delmas, Jack Dongarra, Julien...
IPPS
2003
IEEE
14 years 1 months ago
A Low Cost Fault Tolerant Packet Routing for Parallel Computers
This work presents a new switching mechanism to tolerate arbitrary faults in interconnection networks with a negligible implementation cost. Although our routing technique can be ...
Valentin Puente, José A. Gregorio, Ram&oacu...