Using checkpointing to recover from poor multi-site parallel job scheduling decisions

16 years 22 days ago

Download www.parl.clemson.edu

Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to eﬀectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found that multi-site scheduling techniques beneﬁt from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are many instances where these errors result in poor scheduling decisions that cause network over-subscription. This situation can lead to signiﬁcantly degraded application runtime performance and turnaround time. In this paper, we explore the use of job checkpointing to selectively stop oﬀending jobs in order to alleviate network congestion and subsequently restart them when (and where) suﬃcient network resources are available. We then characterize the conditions and the extent to which checkpointing improves overall performan...

William M. Jones

Real-time Traffic

Job Scheduling | Job Scheduling Leverages | MIDDLEWARE 2007 | Multi-site Scheduling Techniques |

claim paper

Added	08 Jun 2010
Updated	08 Jun 2010
Type	Conference
Year	2007
Where	MIDDLEWARE
Authors	William M. Jones

Sciweavers

Using checkpointing to recover from poor multi-site parallel job scheduling decisions

Job Scheduling | Job Scheduling Leverages | MIDDLEWARE 2007 | Multi-site Scheduling Techniques |

Explore & Download

Productivity Tools

Sciweavers