Sciweavers

IEEEHPCS
2010

Using replication and checkpointing for reliable task management in computational Grids

13 years 9 months ago
Using replication and checkpointing for reliable task management in computational Grids
In grid computing systems, providing fault-tolerance is required for both scientific computation and file-sharing to increase their reliability. In previous works, several mechanisms were proposed for grid or distributed computing systems. However, some of them used only space redundancy (hardware replication), and others used only time redundancy (checkpointing and rollback). For this reason, the existing mechanisms are inefficient in terms of their resource utilization on grid systems. In this paper, we present ART, which is an Adaptive, Reliable, and fault-Tolerant task management for grid computing environments. The main goal of ART is reducing the number of replications by using checkpointing and rollback scheme for each replication. In ART, the minimum number of replications is adaptively selected based on analysis of probability of successful execution within the given deadline and reliability requirement of each task. Our simulation results show that ART can significantly redu...
Sangho Yi, Derrick Kondo, Bongjae Kim, Geunyoung P
Added 13 Feb 2011
Updated 13 Feb 2011
Type Journal
Year 2010
Where IEEEHPCS
Authors Sangho Yi, Derrick Kondo, Bongjae Kim, Geunyoung Park, Yookun Cho
Comments (0)