Sciweavers

442 search results - page 63 / 89
» Fault Tolerant Wide-Area Parallel Computing
Sort
View
MIDDLEWARE
2009
Springer
14 years 2 months ago
Why Do Upgrades Fail and What Can We Do about It?
Abstract. Enterprise-system upgrades are unreliable and often produce downtime or data-loss. Errors in the upgrade procedure, such as broken dependencies, constitute the leading ca...
Tudor Dumitras, Priya Narasimhan
HCW
2000
IEEE
14 years 2 days ago
Evaluation of PAMS' Adaptive Management Services
Management of large-scale parallel and distributed applications is an extremely complex task due to factors such as centralized management architectures, lack of coordination and ...
Yoonhee Kim, Salim Hariri, Muhamad Djunaedi
HPDC
2009
IEEE
14 years 2 months ago
Interconnect agnostic checkpoint/restart in open MPI
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Joshua Hursey, Timothy Mattox, Andrew Lumsdaine
IPPS
2007
IEEE
14 years 1 months ago
Achieving Reliable Parallel Performance in a VoD Storage Server Using Randomization and Replication
This paper investigates randomization and replication as strategies to achieve reliable performance in disk arrays targeted for video-on-demand (VoD) workloads. A disk array can p...
Yung Ryn Choe, Vijay S. Pai
HPDC
2007
IEEE
14 years 2 months ago
Failure-aware checkpointing in fine-grained cycle sharing systems
Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of idle computational resources available on the Internet. Such systems allow guest jobs to run on a ho...
Xiaojuan Ren, Rudolf Eigenmann, Saurabh Bagchi