—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for lar...
Abstract—In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not supp...
As cluster computers are used for a wider range of applications, we encounter the need to deliver resources at particular times, to meet particular deadlines, and/or at the same t...
The computing systems are becoming deeply embedded into ordinary life and interact with physical processes and events. They monitor the physical world with sensors and provide app...
— Advance reservation is a mechanism to guarantee the availability of resources when they are needed. In the context of LambdaGrid, this mechanism is used to provide data-intensi...