Many large-scale production applications often have very long executions times and require periodic data checkpoints in order to save the state of the computation for program rest...
Wei-keng Liao, Avery Ching, Kenin Coloma, Alok N. ...
Abstract-- High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message...
As the number of processors in today’s high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the exe...
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julie...
Virtual private servers and application checkpoint and restart are two advanced operating system features which place different but related requirements on the way kernel-provided...
Sukadev Bhattiprolu, Eric W. Biederman, Serge E. H...
As the desire of scientists to perform ever larger computations drives the size of today’s high performance computers from hundreds, to thousands, and even tens of thousands of ...