Designing high-performance and resilient message passing on InfiniBand

13 years 10 months ago

Download nowlab.cse.ohio-state.edu

Abstract--Clusters featuring the InfiniBand interconnect are continuing to scale. As an example, the "Ranger" system at the Texas Advanced Computing Center (TACC) includes over 60,000 cores with nearly 4,000 InfiniBand ports. The latest Top500 list shows 30% of systems and over 50% of the top 100 are now using InfiniBand as the compute node interconnect. As these systems continue to scale, the Mean-Time-Between-Failure (MTBF) is reducing and additional resiliency must be provided to the important components of HPC systems, including the MPI library. In this paper we present a design that leverages the reliability semantics of InfiniBand, but provides a higherlevel of resiliency. We are able to avoid aborting jobs in the case of network failures as well as failures on the endpoints in the InfiniBand Host Channel Adapters (HCA). We propose reliability designs for rendezvous designs using both Remote DMA (RDMA) read and write operations. We implement a prototype of our design an...

Matthew J. Koop, Pavel Shamis, Ishai Rabinovitz, D

Real-time Traffic

Design | Distributed And Parallel Computing | InfiniBand | IPPS 2010 | Texas Advanced Computing Center |

claim paper

Post Info
More Details (n/a)

Added	05 Mar 2011
Updated	05 Mar 2011
Type	Journal
Year	2010
Where	IPPS
Authors	Matthew J. Koop, Pavel Shamis, Ishai Rabinovitz, Dhabaleswar K. Panda

Comments (0)

Sciweavers

Designing high-performance and resilient message passing on InfiniBand

Design | Distributed And Parallel Computing | InfiniBand | IPPS 2010 | Texas Advanced Computing Center |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers