The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications....
Joseph F. Ruscio, Michael A. Heffner, Srinidhi Var...
Soft-state is a well established approach to designing robust network protocols and applications. However it is unclear how to apply soft-state approach to protocols that must mai...
As technology scaling poses a threat to DRAM scaling due to physical limitations such as limited charge, alternative memory technologies including several emerging non-volatile me...
This paper presents a theoretical and experimental study on the limitations of copy-on-write snapshots and incremental backups in terms of data recoverability. We provide mathemat...