Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications

14 years 7 months ago

Download charm.cs.illinois.edu

—Computing systems will grow signiﬁcantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, ﬁne-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team checkpointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead fro...

Esteban Meneses, Laxmikant V. Kalé, Greg Br

Real-time Traffic

CLUSTER 2011 | Computational Scientists | Distributed And Parallel Computing | Fault Tolerance | Processor Chips |

claim paper

Post Info
More Details (n/a)

Added	18 Dec 2011
Updated	18 Dec 2011
Type	Journal
Year	2011
Where	CLUSTER
Authors	Esteban Meneses, Laxmikant V. Kalé, Greg Bronevetsky

Comments (0)

Sciweavers

Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications

CLUSTER 2011 | Computational Scientists | Distributed And Parallel Computing | Fault Tolerance | Processor Chips |

Explore & Download

Productivity Tools

Sciweavers