Sciweavers

ICPP
2007
IEEE

Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

14 years 6 months ago
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance approach can only improve system productivity marginally. Leveraging the recent progress made in the field of failure prediction, we propose fault-driven rescheduling (FARS) to improve system resilience to failures, and investigate the feasibility and effectiveness of utilizing failure prediction to dynamically adjust the placement of active jobs (e.g. running jobs) in response to failure prediction. In particular, a rescheduling algorithm is designed to enable effective job adjustment by evaluating performance impact of potential failures and rescheduling on user jobs. The proposed FARS complements existing research on fault-aware scheduling by allowing user jobs to avoid imminent failures at runtime. We evaluate FARS by using actual wor...
Yawei Li, Prashasta Gujrati, Zhiling Lan, Xian-He
Added 03 Jun 2010
Updated 03 Jun 2010
Type Conference
Year 2007
Where ICPP
Authors Yawei Li, Prashasta Gujrati, Zhiling Lan, Xian-He Sun
Comments (0)