Sciweavers

SC
2009
ACM

FALCON: a system for reliable checkpoint recovery in shared grid environments

14 years 6 months ago
FALCON: a system for reliable checkpoint recovery in shared grid environments
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as the performance degradation is tolerable. For guest users, free resources come at the cost of unpredictable “failures”, where failures are defined as disruption in the guest job’s execution due to contention from the processes of the machine owner or the conventionally understood hardware and software failures. These unpredictable failures lead to unpredictable completion times. Checkpointrecovery has long been used for providing reliability in failureprone computing environments. Today’s production FGCS systems, such as Condor, use expensive, high-performance dedicated checkpoint servers, even though they could take advantage of free disk resources offered by the clusters’ commodity machines. Also, in large, geographically distributed clusters, dedicated checkpoint servers may incur high checkpoint transfer latencies. In this paper we consider...
Tanzima Zerin Islam, Saurabh Bagchi, Rudolf Eigenm
Added 19 May 2010
Updated 19 May 2010
Type Conference
Year 2009
Where SC
Authors Tanzima Zerin Islam, Saurabh Bagchi, Rudolf Eigenmann
Comments (0)