Sciweavers

PPOPP
2005
ACM

Fault tolerant high performance computing by a coding approach

14 years 5 months ago
Fault tolerant high performance computing by a coding approach
As the number of processors in today’s high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today’s architectures are usually robust enough to survive node failures without suffering complete system failure, most today’s high performance computing applications can not survive node failures and, therefore, whenever a node fails, have to abort themselves and restart from the beginning or a stable-storage-based checkpoint. This paper explores the use of the floating-point arithmetic coding approach to build fault survivable high performance computing applications so that they can adapt to node failures without aborting themselves. Despite the use of erasure codes over Galois field has been theoretically attempted before in diskless checkpointing, few actual implementations exist. This probably derives from concerns rel...
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julie
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where PPOPP
Authors Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, Jack Dongarra
Comments (0)