Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Abstract. Service composition, which enables users to construct complex services from atomic services, is an essential feature for the usability of Mobile Ad hoc Networks (MANETs)....
Zhen-guo Gao, Sheng Liu, Ming Ji, Jinhua Zhao, Lih...
Record and Replay (RR) is a software based state replication solution designed to support recording and subsequent replay of the execution of unmodified applications running on mu...
Philippe Bergheaud, Dinesh Subhraveti, Marc Vertes
We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detect...
This paper addresses the issue of fault-tolerance in applications that make use of network storage. A network abstraction called the Network Storage Stack is presented, along with...
Scott Atchley, Stephen Soltesz, James S. Plank, Mi...