This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a ...
As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault...
— This paper describes a modeling framework for evaluating the impact of faults on the output of streaming ions. Our model is based on three abstractions: stream operators, strea...
Gabriela Jacques-Silva, Zbigniew Kalbarczyk, Bugra...
In this paper, we present a checkpoint-based scheme to improve the turnaround time of bag-of-tasks applications executed on institutional desktop grids. We propose to share checkp...
- In this paper, we present a tool to extract I/O traces from very large applications running at full scale during their production runs. We analyze these traces to gain informatio...
Nithin Nakka, Alok N. Choudhary, Wei-keng Liao, Le...