JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management

15 years 1 months ago

Download www.csm.ornl.gov

Most of today‘s HPC systems employ a single head node for control, which represents a single point of failure as it interrupts an entire HPC system upon failure. Furthermore, it is also a single point of control as it disables an entire HPC system until repair. One of the most important HPC system service running on the head node is the job and resource management. If it goes down, all currently running jobs loose the service they report back to. They have to be restarted once the head node is up and running again. With this paper, we present a generic approach for providing symmetric active/active replication for highly available HPC job and resource management. The JOSHUA solution provides a virtually synchronous environment for continuous availability without any interruption of service and without any loss of state. Replication is performed externally via the PBS service interface without the need to modify any service code. Test results as well as availability analysis of our p...

Kai Uhlemann, Christian Engelmann, Stephen L. Scot

Real-time Traffic

CLUSTER 2006 | Cluster Computing | Entire Hpc | Head Node | Today‘s Hpc Systems |

claim paper

Post Info
More Details (n/a)

Added	10 Jun 2010
Updated	10 Jun 2010
Type	Conference
Year	2006
Where	CLUSTER
Authors	Kai Uhlemann, Christian Engelmann, Stephen L. Scott

Comments (0)

Sciweavers

JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management

CLUSTER 2006 | Cluster Computing | Entire Hpc | Head Node | Today‘s Hpc Systems |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers