Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

16 years 1 months ago

Download www.cs.iit.edu

The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance approach can only improve system productivity marginally. Leveraging the recent progress made in the field of failure prediction, we propose fault-driven rescheduling (FARS) to improve system resilience to failures, and investigate the feasibility and effectiveness of utilizing failure prediction to dynamically adjust the placement of active jobs (e.g. running jobs) in response to failure prediction. In particular, a rescheduling algorithm is designed to enable effective job adjustment by evaluating performance impact of potential failures and rescheduling on user jobs. The proposed FARS complements existing research on fault-aware scheduling by allowing user jobs to avoid imminent failures at runtime. We evaluate FARS by using actual wor...

Yawei Li, Prashasta Gujrati, Zhiling Lan, Xian-He

Real-time Traffic

Distributed And Parallel Computing | Failure Prediction | Fault-driven Rescheduling | ICPP 2007 | User Jobs |

claim paper

Post Info
More Details (n/a)

Added	03 Jun 2010
Updated	03 Jun 2010
Type	Conference
Year	2007
Where	ICPP
Authors	Yawei Li, Prashasta Gujrati, Zhiling Lan, Xian-He Sun

Comments (0)

Sciweavers

Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

Distributed And Parallel Computing | Failure Prediction | Fault-driven Rescheduling | ICPP 2007 | User Jobs |

Explore & Download

Productivity Tools

Sciweavers