Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

186

TC
2008

146views Information Technology» more TC 2008»

Adaptive Fault Management of Parallel Applications for High-Performance Computing

15 years 6 months ago

Adaptive Fault Management of Parallel Applications for High-Performance Computing

Download www.cs.iit.edu

As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault management approach that combines proactive migration with reactive checkpointing. It aims to enable parallel applications to avoid anticipated failures via preventive migration and, in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed to make runtime decisions in response to failure prediction. Extensive experiments, by means of stochastic modeling and case studies with real applications, indicate that FT-Pro outperforms periodic checkpointing, in terms of reducing application completion times and improving resource utilization, by up to 43 percent.

Zhiling Lan, Yawei Li

Real-time Traffic

Adaptive Fault Management | Information Technology | Outperforms Periodic Checkpointing | Parallel Applications | TC 2008 |

claim paper

Related Content

» Fault tolerant high performance computing by a coding approach

» Self Adaptive Application Level Fault Tolerance for Parallel and Distributed Computing

» Algorithmic Based Fault Tolerance Applied to High Performance Computing

» SystemLevel Virtualization for High Performance Computing

» Enhancing application robustness through adaptive fault tolerance

» Parallel Processing on Networks of Workstations A FaultTolerant High Performance Approach

» SoftwareImplemented Fault Detection for HighPerformance Space Applications

» QsNetIII an Adaptively Routed Network for High Performance Computing

» Fault Tolerant WideArea Parallel Computing

Post Info
More Details (n/a)

Added	15 Dec 2010
Updated	15 Dec 2010
Type	Journal
Year	2008
Where	TC
Authors	Zhiling Lan, Yawei Li

Comments (0)