Sciweavers

22 search results - page 3 / 5
» Fault Tolerance in Message Passing and in Action
Sort
View
CONCURRENCY
2010
110views more  CONCURRENCY 2010»
13 years 8 months ago
Redesigning the message logging model for high performance
Over the past decade the number of processors in the high performance facilities went up to hundreds of thousands. As a direct consequence, while the computational power follow th...
Aurelien Bouteiller, George Bosilca, Jack Dongarra
JCP
2006
100views more  JCP 2006»
13 years 8 months ago
A Local Enumeration Protocol in Spite of Corrupted Data
We present a novel self-stabilizing version of Mazurkiewicz enumeration algorithm [1]. The initial version is based on local rules to enumerate nodes on an anonymous network. [2] p...
Brahim Hamid, Mohamed Mosbah
ICDCS
2000
IEEE
14 years 28 days ago
On Low-Cost Error Containment and Recovery Methods for Guarded Software Upgrading
To assure dependable onboard evolution, we have developed a methodology called guarded software upgrading (GSU). In this paper, we focus on a low-cost approach to error containmen...
Ann T. Tai, Kam S. Tso, Leon Alkalai, Savio N. Cha...
HPDC
2009
IEEE
14 years 3 months ago
Interconnect agnostic checkpoint/restart in open MPI
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Joshua Hursey, Timothy Mattox, Andrew Lumsdaine
SPAA
2010
ACM
14 years 1 months ago
Brief announcement: byzantine agreement with homonyms
In this work, we address Byzantine agreement in a message passing system with homonyms, i.e. a system with a number l of authenticated identities that is independent of the total ...
Carole Delporte-Gallet, Hugues Fauconnier, Rachid ...