Sciweavers

668 search results - page 4 / 134
» Implementing and Evaluating Automatic Checkpointing
Sort
View
IJHPCA
2006
117views more  IJHPCA 2006»
13 years 6 months ago
MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI
Abstract-- High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message...
Aurelien Bouteiller, Thomas Hérault, G&eacu...
DELTA
2006
IEEE
14 years 22 days ago
Synthesis of Fault-Tolerant Embedded Systems with Checkpointing and Replication
We present an approach to the synthesis of fault-tolerant hard real-time systems for safety-critical applications. We use checkpointing with rollback recovery and active replicati...
Viacheslav Izosimov, Paul Pop, Petru Eles, Zebo Pe...
FGCS
2008
140views more  FGCS 2008»
13 years 6 months ago
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols
A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant progr...
Darius Buntinas, Camille Coti, Thomas Hérau...
JFP
2010
107views more  JFP 2010»
13 years 5 months ago
Lightweight checkpointing for concurrent ML
Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execut...
Lukasz Ziarek, Suresh Jagannathan
SBACPAD
2005
IEEE
110views Hardware» more  SBACPAD 2005»
14 years 8 days ago
Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments
Executing long-running parallel applications in Opportunistic Grid environments composed of heterogeneous, shared user workstations, is a daunting task. Machines may fail, become ...
Raphael Y. de Camargo, Fabio Kon, Alfredo Goldman