Sciweavers

212 search results - page 28 / 43
» Model-based fault localization in large-scale computing syst...
Sort
View
IPPS
2005
IEEE
14 years 1 months ago
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules
Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore...
Sebastian Gerlach, Roger D. Hersch
GRID
2006
Springer
13 years 8 months ago
The Palantir Grid Meta-Information System
Grids allow large scale resource-sharing across different administrative domains. Those diverse resources are likely to join or quit the Grid at any moment or possibly to break dow...
Francesc Guim, Ivan Rodero, M. Tomas, Julita Corba...
HPDC
2008
IEEE
14 years 2 months ago
DataLab: transactional data-parallel computing on an active storage cloud
Active storage clouds are an attractive platform for executing large data intensive workloads found in many fields of science. However, active storage presents new system managem...
Brandon Rich, Douglas Thain
CCGRID
2006
IEEE
14 years 1 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra
ATAL
2003
Springer
14 years 1 months ago
A protocol for multi-agent diagnosis with spatially distributed knowledge
In a large distributed system it is often infeasible or even impossible to perform diagnosis using a single model of the whole system. Instead, several spatially distributed local...
Nico Roos, Annette ten Teije, Cees Witteveen