Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing

14 years 7 months ago

Download www.cs.iit.edu

As the scale of cluster computing grows, it is becoming hard for long-running applications to complete without facing failures on large-scale clusters. To address this issue, checkpointing/restart is widely used to provide the basic fault-tolerant functionality, yet it suffers from high overhead and its reactive characteriristic. In this work, we propose FT-Pro, an adaptive fault management mechanism that optimally chooses migration, checkpointing or no action to reduce the application execution time in the presence of failures based on the failure prediction. A cost-based evaluation model is presented for dynamic decision at run-time. Using the actual failure log from a production cluster at NCSA, we demonstrate that even with modest failure prediction accuracy, FT-Pro outperforms the traditional checkpointing/restart strategy by 13%-30% in terms of reducing the application execution time despite failures, which is a significant performance improvement for long-running applications.

Yawei Li, Zhiling Lan

Real-time Traffic

Application Execution Time | CCGRID 2006 | Cluster Computing | Failure Prediction | Long-running Applications |

claim paper

Post Info
More Details (n/a)

Added	10 Jun 2010
Updated	10 Jun 2010
Type	Conference
Year	2006
Where	CCGRID
Authors	Yawei Li, Zhiling Lan

Comments (0)

Sciweavers

Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing

Application Execution Time | CCGRID 2006 | Cluster Computing | Failure Prediction | Long-running Applications |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers