— Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault ...
Aurelien Bouteiller, Boris Collin, Thomas Hé...
Management of a modern enterprise is based on the assumption that executive reports of lower-layer management are faithful to what is actually happening in the field. As some well...
We present in this paper an extension of the messagedriven confidence-driven framework that we developed for onboard guarded software upgrading. The purpose of this work is to pr...
Many high-performance tools, applications and infrastructures, such as Paradyn, STAT, TAU, Ganglia, SuperMon, Astrolabe, Borealis, and MRNet, use data aggregation to synthesize lar...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-pas...
Rajanikanth Batchu, Yoginder S. Dandass, Anthony S...