Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

180

CCGRID
2008
IEEE

132views Distributed And Parallel Com...» more CCGRID 2008»

Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

15 years 6 months ago

Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Download xcr.cenit.latech.edu

In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our algorithms for over-provisioning and migration, which are our primary strategies for fault-tolerance. We consider application performance models, resource reliability models, network latency and bandwidth and queue wait times for batch-queues on compute resources for determining the correct fault-tolerance strategy. Our goal is to balance reliability and performance in the presence of soft real-time constraints like deadlines and expected success probabilities, and to do it in a way that is transparent to scientists. We have evaluated our strategies by developing a Fault-Tolerance and Recovery (FTR) service and deploying it as a part of the Linked Environments for Atmospheric Discovery (LEAD) production infrastructure. Results from real usage scenarios in LEAD show that the failure rate of individual steps in...

Gopi Kandaswamy, Anirban Mandal, Daniel A. Reed

Real-time Traffic

CCGRID 2008 | Complex Scientific Workflows | Correct Fault-tolerance Strategy | Distributed And Parallel Computing | Fault-tolerance |

claim paper

Related Content

» FaultTolerant BPEL Workflow Execution via CloudAware Recovery Policies

» Theoretical enzyme design using the Kepler scientific workflows on the Grid

» Performability modeling for scheduling and fault tolerance strategies for scientific workf...

» An Autonomic Workflow Management System for Global Grids

» Kadre domainspecific architectural recovery for scientific software systems

» Implementation of FaultTolerant GridRPC Applications

» VGrADS enabling eScience workflows on grids and clouds with fault tolerance

» WorkflowOriented Collaborative Grid Portals

» Kepler Hadoop a general architecture facilitating dataintensive applications in scientifi...

Post Info
More Details (n/a)

Added	07 Dec 2010
Updated	07 Dec 2010
Type	Conference
Year	2008
Where	CCGRID
Authors	Gopi Kandaswamy, Anirban Mandal, Daniel A. Reed

Comments (0)