Proactive fault tolerance for HPC with Xen virtualization

14 years 6 months ago

Download www.csm.ornl.gov

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today’s systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from “unhealthy” nodes to healthy ones. Our approach relies on operating system virtualization techniques exempliﬁed by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen’s live migration mechanism for a guest operating system (OS) to migrate an MP...

Arun Babu Nagarajan, Frank Mueller, Christian Enge

Real-time Traffic

Health Monitoring | ICS 2007 | Live Migration | Proactive Ft | Theoretical Computer Science |

claim paper

Post Info
More Details (n/a)

Added	08 Jun 2010
Updated	08 Jun 2010
Type	Conference
Year	2007
Where	ICS
Authors	Arun Babu Nagarajan, Frank Mueller, Christian Engelmann, Stephen L. Scott

Comments (0)

Sciweavers

Proactive fault tolerance for HPC with Xen virtualization

Health Monitoring | ICS 2007 | Live Migration | Proactive Ft | Theoretical Computer Science |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers