BlueGene/L Failure Analysis and Prediction Models

16 years 18 days ago

Download www.ece.rutgers.edu

The growing computational and storage needs of several scientiﬁc applications mandate the deployment of extreme-scale parallel machines, such as IBM’s BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime faulttolerant techniques such as periodic checkpointing are not eﬀective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more eﬀective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as w...

Yinglung Liang, Yanyong Zhang, Anand Sivasubramani

Real-time Traffic

Computer Networks | DSN 2006 | Failure Occurrences | Failure Prediction | Realistic Failure Data |

claim paper

» System log preprocessing to improve failure prediction

» A hybrid financial analysis model for business failure prediction

» MarkovBased Failure Prediction for Human Motion Analysis

» Predicting failures with developer networks and social network analysis

» An Artificial Neural Network based Model to Analyze Malarial Data and Predict Organ Failur...

» Reliability analysis of the fine pitch connection using anisotropic conductive film ACF

» Using Hidden SemiMarkov Models for Effective Online Failure Prediction

» Discovering Rules from Disk Events for Predicting Hard Drive Failures

Post Info
More Details (n/a)

Added	11 Jun 2010
Updated	11 Jun 2010
Type	Conference
Year	2006
Where	DSN
Authors	Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra K. Sahoo

Comments (0)

Sciweavers

BlueGene/L Failure Analysis and Prediction Models

Computer Networks | DSN 2006 | Failure Occurrences | Failure Prediction | Realistic Failure Data |

Explore & Download

Productivity Tools

Sciweavers