Using replication and checkpointing for reliable task management in computational Grids

15 years 4 months ago

Download mescal.imag.fr

In grid computing systems, providing fault-tolerance is required for both scientific computation and file-sharing to increase their reliability. In previous works, several mechanisms were proposed for grid or distributed computing systems. However, some of them used only space redundancy (hardware replication), and others used only time redundancy (checkpointing and rollback). For this reason, the existing mechanisms are inefficient in terms of their resource utilization on grid systems. In this paper, we present ART, which is an Adaptive, Reliable, and fault-Tolerant task management for grid computing environments. The main goal of ART is reducing the number of replications by using checkpointing and rollback scheme for each replication. In ART, the minimum number of replications is adaptively selected based on analysis of probability of successful execution within the given deadline and reliability requirement of each task. Our simulation results show that ART can significantly redu...

Sangho Yi, Derrick Kondo, Bongjae Kim, Geunyoung P

Real-time Traffic

Applied Computing | Computing Systems | Grid | Grid Computing | IEEEHPCS 2010 |

claim paper

» Checkpoint and Restart for Distributed Components in XCAT3

» GraphBased Task Replication for Workflow Applications

» Dynasa adapting grid applications to safety using faulttolerant methods

» Towards Autonomic Grid Data Management with Virtualized Distributed File Systems

» Probabilistic allocation of tasks on desktop grids

» Supporting applicationtailored grid file system sessions with WSRFbased services

» Intelligent Selection of Fault Tolerance Techniques on the Grid

» MultiReplication with Intelligent Staging in DataIntensive Grid Applications

Post Info
More Details (n/a)

Added	13 Feb 2011
Updated	13 Feb 2011
Type	Journal
Year	2010
Where	IEEEHPCS
Authors	Sangho Yi, Derrick Kondo, Bongjae Kim, Geunyoung Park, Yookun Cho

Comments (0)

Sciweavers

Using replication and checkpointing for reliable task management in computational Grids

Applied Computing | Computing Systems | Grid | Grid Computing | IEEEHPCS 2010 |

Explore & Download

Productivity Tools

Sciweavers