Sciweavers

48 search results - page 8 / 10
» Self-stabilizing algorithm for checkpointing in a distribute...
Sort
View
TROB
2002
244views more  TROB 2002»
13 years 7 months ago
Distributed surveillance and reconnaissance using multiple autonomous ATVs: CyberScout
The objective of the CyberScout project is to develop an autonomous surveillance and reconnaissance system using a network of all-terrain vehicles. In this paper, we focus on two f...
Mahesh Saptharishi, C. Spence Oliver, Christopher ...
PVM
2005
Springer
14 years 1 months ago
Scalable Fault Tolerant MPI: Extending the Recovery Algorithm
ct Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The init...
Graham E. Fagg, Thara Angskun, George Bosilca, Jel...
ICDCS
1996
IEEE
13 years 12 months ago
How to Recover Efficiently and Asynchronously when Optimism Fails
We propose a new algorithm for recovering asynchronously from failures in a distributed computation. Our algorithm is based on two novel concepts - a fault-tolerant vector clock t...
Om P. Damani, Vijay K. Garg
EGC
2005
Springer
14 years 1 months ago
Transparent Fault Tolerance for Grid Applications
A major challenge facing grid applications is the appropriate handling of failures. In this paper we address the problem of making parallel Java applications based on Remote Method...
Pawel Garbacki, Bartosz Biskupski, Henri E. Bal
ICDCS
2011
IEEE
12 years 7 months ago
Provisioning a Multi-tiered Data Staging Area for Extreme-Scale Machines
—Massively parallel scientific applications, running on extreme-scale supercomputers, produce hundreds of terabytes of data per run, driving the need for storage solutions to im...
Ramya Prabhakar, Sudharshan S. Vazhkudai, Youngjae...