Raptor: Integrating Checkpoints and Thread Migration for Cluster Management

15 years 4 months ago

Download ecadw.colorado.edu

distributed shared-memory (SDSM) provides the abstraction necessary to run shared-memory applications on cost-effective parallel platforms such as clusters of workstations. However, problems such as cluster component reliability and cluster management, which are not directly related to performance, need to be addressed before SDSM solutions can be widely adopted. This paper presents Raptor, a SDSM cluster management system based on checkpoint/recovery and thread migration. Raptor decouples the runtime system and application data from application threads, allowing efficient load balancing, resource allocation, and rollback recovery. There are two important features of the system. First, it reduces checkpoint overhead by only saving application-specific data that cannot be recreated at recovery time. Second, by integrating thread migration capability both at runtime or recovery, it allows the addition or removal of computing resources from a running application while adding little or no...

Hazim Shafi, Evan Speight, John K. Bennett

Real-time Traffic

Cluster Component Reliability | Cluster Management | SRDS 2003 | Thread Migration |

claim paper

Post Info
More Details (n/a)

Added	05 Jul 2010
Updated	05 Jul 2010
Type	Conference
Year	2003
Where	SRDS
Authors	Hazim Shafi, Evan Speight, John K. Bennett

Comments (0)

Sciweavers

Raptor: Integrating Checkpoints and Thread Migration for Cluster Management

Cluster Component Reliability | Cluster Management | SRDS 2003 | Thread Migration |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers