The increasing numbers of cores, shared caches and memory nodes within machines introduces a complex hardware topology. High-performance computing applications now have to carefull...
In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operatio...
Abstract—In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not supp...