

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

14 years 8 months ago
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
—Considerable work has been done on providing fault tolerance capabilities for different software components on largescale high-end computing systems. Thus far, however, these faulttolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the C...
Rinku Gupta, Pete Beckman, Byung-Hoon Park, Ewing
Added 23 May 2010
Updated 23 May 2010
Type Conference
Year 2009
Where ICPP
Authors Rinku Gupta, Pete Beckman, Byung-Hoon Park, Ewing L. Lusk, Paul Hargrove, Al Geist, Dhabaleswar K. Panda, Andrew Lumsdaine, Jack Dongarra
Comments (0)