Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementati...
Joshua Hursey, Jeffrey M. Squyres, Timothy Mattox,...
Thread migration/checkpointing is becoming indispensable for load balancing and fault tolerance in high performance computing applications, and its success depends on the migration...
Checkpointing and replaying is an attractive technique that has been used widely at the operating/runtime system level to provide fault tolerance. Applying such a technique at the...
Real time applications such as military aircraft flight control systems and online banking are critical with respect to security and reliability. In this paper we presented a way ...
Kiranmai Bellam, Raghava K. Vudata, Xiao Qin, Zili...