Highly Reliable Linux HPC Clusters: Self-Awareness Approach

15 years 12 months ago

Download xcr.cenit.latech.edu

Abstract. Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HAOSCAR via multi-head-node failover and a service level fault tolerance mechanism. Our solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors. Finally, the paper also entails the details of the system layering strategy, dependability modeling, and analysis of an actual experimental system by a Petri net-based model, Stochastic Reword Net (SRN).

Chokchai Leangsuksun, Tong Liu, Yudan Liu, Stephen

Real-time Traffic

Effort Making Inroads | Fault Tolerance Mechanism | ISPA 2004 | Petri Net-based Model |

claim paper

» Containerbased operating system virtualization a scalable highperformance alternative to h...

» High Performance Computing for Disease Surveillance

» An Open Source performance tools software suite for scientific computing

» A tracedriven emulation framework to predict scalability of large clusters in presence of ...

» SSWrapper a package of wrapper applications for similarity searches on Linux clusters

» A High Performance Communication Subsystem for PODOS

» Facilitating interapplication interactions for OSlevel virtualization

» Group Communication in Differentiated Services Networks

Post Info
More Details (n/a)

Added	02 Jul 2010
Updated	02 Jul 2010
Type	Conference
Year	2004
Where	ISPA
Authors	Chokchai Leangsuksun, Tong Liu, Yudan Liu, Stephen L. Scott, Richard Libby, Ibrahim Haddad

Comments (0)

Sciweavers

Highly Reliable Linux HPC Clusters: Self-Awareness Approach

Effort Making Inroads | Fault Tolerance Mechanism | ISPA 2004 | Petri Net-based Model |

Explore & Download

Productivity Tools

Sciweavers