Sciweavers

392 search results - page 53 / 79
» Fault Tolerance in a DSM Cluster Operating System
Sort
View
IPPS
2006
IEEE
14 years 2 months ago
Coordinated checkpoint from message payload in pessimistic sender-based message logging
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. Two category tec...
M. Aminian, Mohammad K. Akbari, Bahman Javadi
HPDC
2009
IEEE
14 years 3 months ago
Interconnect agnostic checkpoint/restart in open MPI
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Joshua Hursey, Timothy Mattox, Andrew Lumsdaine
SRDS
1999
IEEE
14 years 28 days ago
Enforcing Determinism for the Consistent Replication of Multithreaded CORBA Applications
In CORBA-based applications that depend on object replication for fault tolerance, inconsistencies in the states of the replicas of an object can arise when concurrent threads wit...
Priya Narasimhan, Louise E. Moser, P. M. Melliar-S...
SRDS
2007
IEEE
14 years 2 months ago
The Fail-Heterogeneous Architectural Model
Fault tolerant distributed protocols typically utilize a homogeneous fault model, either fail-crash or fail-Byzantine, where all processors are assumed to fail in the same manner....
Marco Serafini, Neeraj Suri
JAVA
2001
Springer
14 years 1 months ago
A scalable, robust network for parallel computing
CX, a network-based computational exchange, is presented. The system’s design integrates variations of ideas from other researchers, such as work stealing, non-blocking tasks, e...
Peter R. Cappello, Dimitros Mourloukos