Communication Across Fault-Containment Firewalls on the SGI Origin

14 years 6 months ago

Download tab.computer.org

Scalability and reliability are inseparable in high-performance computing. Fault-isolation through hardware is a popular means of providing reliability. Unfortunately, such isolation also increases communication latencies: typically, one has to drop into and out of the kernel to communicate between failure domains. On the other hand, relaxing fault isolation domains allows e cient communication, but at the risk of failure propagation, and thus reduced reliability. We are concerned with nding a middle ground between these extremes. We rst review a few salient aspects of the SGI Origin2000 architecture, mentioning the hardware features germane to e cient communication, and building protection- rewalls. Then, we describe a mechanism for risk-free, point-to-point communication between processes on distinct failure domains. Quoting performance numbers, we show that the overheads of crossing domains render this mechanism unattractive for small messages. To address this issue, we describe a ...

Kaushik Ghosh, Allan J. Christie

Real-time Traffic

Cient Communication | Distributed And Parallel Computing | Failure Domains | HPCA 1998 | Isolation Domains Allows |

claim paper

Post Info
More Details (n/a)

Added	04 Aug 2010
Updated	04 Aug 2010
Type	Conference
Year	1998
Where	HPCA
Authors	Kaushik Ghosh, Allan J. Christie

Comments (0)

Sciweavers

Communication Across Fault-Containment Firewalls on the SGI Origin

Cient Communication | Distributed And Parallel Computing | Failure Domains | HPCA 1998 | Isolation Domains Allows |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers