DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wid...
Future scalable, high throughput, and high performance applications are likely to execute on platforms constructed by clustering multiple autonomous distributed servers, with reso...
Many parallel applications from scientific computing use MPI collective communication operations to collect or distribute data. Since the execution times of these communication op...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...
High-performance computing (HPC) systems consume a significant amount of power, resulting in high operational costs, reduced reliability, and wasting of natural resources. Therefor...
Reza Zamani, Ahmad Afsahi, Ying Qian, V. Carl Hama...