Sciweavers

HIPC
2009
Springer

Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture

13 years 9 months ago
Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture
Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has reduced from days to hours. As a result, fault tolerance within the cluster has become imperative. MPI, the de-facto standard for parallel programming, is widely used on such large clusters. Many MPI implementations use Checkpoint/Restart schemes using the Berkeley Lab Checkpoint Restart (BLCR) Library to achieve some level of fault tolerance. However, the performance of the Checkpoint/Restart mechanism does not scale well with increasing job size. As a result, the deployment of Checkpoint/Restart mechanisms for large scale parallel applications is compromised. In our previous work, we proposed a technique to aggregate certain categories of checkpoint writes to reduce the checkpointing overhead. However, an application still experiences slow checkpoint writing because it is blocked waiting for its checkpoint file...
Xiangyong Ouyang, Karthik Gopalakrishnan, Tejus Ga
Added 18 Feb 2011
Updated 18 Feb 2011
Type Journal
Year 2009
Where HIPC
Authors Xiangyong Ouyang, Karthik Gopalakrishnan, Tejus Gangadharappa, Dhabaleswar K. Panda
Comments (0)