Sciweavers

464 search results - page 19 / 93
» A Fault Tolerance Protocol with Fast Fault Recovery
Sort
View
FGCS
2008
140views more  FGCS 2008»
13 years 7 months ago
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols
A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant progr...
Darius Buntinas, Camille Coti, Thomas Hérau...
EDCC
2006
Springer
13 years 11 months ago
SEU Mitigation Techniques for Microprocessor Control Logic
The importance of fault tolerance at the processor architecture level has been made increasingly important due to rapid advancements in the design and usage of high performance de...
T. S. Ganesh, Viswanathan Subramanian, Arun K. Som...
CLUSTER
2003
IEEE
14 years 27 days ago
Coordinated Checkpoint versus Message Log for Fault Tolerant MPI
— Large Clusters, high availability clusters and Grid deployments often suffer from network, node or operating system faults and thus require the use of fault tolerant programmin...
Aurelien Bouteiller, Pierre Lemarinier, Gér...
TDSC
2011
13 years 2 months ago
Application-Level Diagnostic and Membership Protocols for Generic Time-Triggered Systems
Abstract— We present on-line tunable diagnostic and membership protocols for generic time-triggered (TT) systems to detect crashes, send/receive omission faults and network parti...
Marco Serafini, Péter Bokor, Neeraj Suri, J...
PVM
2007
Springer
14 years 1 months ago
Using CMT in SCTP-Based MPI to Exploit Multiple Interfaces in Cluster Nodes
Many existing clusters use inexpensive Gigabit Ethernet and often have multiple interfaces cards to improve bandwidth and enhance fault tolerance. We investigate the use of Concurr...
Brad Penoff, Mike Tsai, Janardhan R. Iyengar, Alan...