Sciweavers

USENIX
1996

Transparent Fault Tolerance for Parallel Applications on Networks of Workstations

14 years 2 months ago
Transparent Fault Tolerance for Parallel Applications on Networks of Workstations
This paper describes a new method for providingtransparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of shared object system called SAM, a portable run-time system which provides a global name space and automatic caching of shared data. SAM incorporates a novel design intended to address the problem of the highcommunicationoverheads indistributed memory environments and is implemented on a variety of distributed memory platforms. Our fundamental approach to providing fault tolerance is to ensure the replication of all data on more than one workstation usingthe dynamic caching already providedby SAM. The replicated data is accessible to the local processor like other cached data, making access to shared data faster and potentially offsetting some of the fault tolerance overhead. In addition, our method uses information available in SAM applications on how processes access shared data to enable several optimization...
Daniel J. Scales, Monica S. Lam
Added 02 Nov 2010
Updated 02 Nov 2010
Type Conference
Year 1996
Where USENIX
Authors Daniel J. Scales, Monica S. Lam
Comments (0)