Sciweavers

482 search results - page 32 / 97
» A large-scale study of failures in high-performance computin...
Sort
View
PDP
2002
IEEE
15 years 8 months ago
Eventually Consistent Failure Detectors
The concept of unreliable failure detector was introduced by Chandra and Toueg as a mechanism that provides information about process failures. This mechanism has been used to sol...
Mikel Larrea, Antonio Fernández, Sergio Ar&...
134
Voted
HPDC
2006
IEEE
15 years 9 months ago
Materializing Highly Available Grids
Grids are becoming a mission-critical component in research and industry. The services they provide are thus required to be highly available, contributing to the vision of the Gri...
Mark Silberstein, Gabriel Kliot, Artyom Sharov, As...
ICDCS
2011
IEEE
14 years 3 months ago
Provisioning a Multi-tiered Data Staging Area for Extreme-Scale Machines
—Massively parallel scientific applications, running on extreme-scale supercomputers, produce hundreds of terabytes of data per run, driving the need for storage solutions to im...
Ramya Prabhakar, Sudharshan S. Vazhkudai, Youngjae...
109
Voted
DSN
2002
IEEE
15 years 8 months ago
Robust Software - No More Excuses
Software developers identify two main reasons why software systems are not made robust: performance and practicality. This work demonstrates the effectiveness of general technique...
John DeVale, Philip Koopman
PADS
2006
ACM
15 years 9 months ago
A Framework for Robust HLA-based Distributed Simulations
The High Level Architecture (HLA) is a standard for the interoperability and reuse of simulation components, referred to as federates. Large scale HLA-compliant simulations are bu...
Dan Chen, Stephen John Turner, Wentong Cai