A Fault Detection Service for Wide Area Distributed Computations

15 years 6 months ago

Download www-unix.globus.org

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.

Paul Stelling, Ian T. Foster, Carl Kesselman, Crai

Real-time Traffic

Computing Systems | Distributed And Parallel Computing | Distributed Computing | HPDC 1998 | Unreliable Fault Detectors |

claim paper

» QoSAware Discovery of WideArea Distributed Services

» HighSpeed Wide Area Data Intensive Computing A Ten Year Retrospective

» PlanetSeer Internet Path Failure Monitoring and Characterization in WideArea Services

» MetaGrid A Scalable Framework for WideArea Service Deployment and Management

» Wide Area Cluster Monitoring with Ganglia

» WebOS Operating System Services for Wide Area Applications

» Wide Area Computation

» A ProblemSpecific FaultTolerance Mechanism for Asynchronous Distributed Systems

Post Info
More Details (n/a)

Added	04 Aug 2010
Updated	04 Aug 2010
Type	Conference
Year	1998
Where	HPDC
Authors	Paul Stelling, Ian T. Foster, Carl Kesselman, Craig A. Lee, Gregor von Laszewski

Comments (0)

Sciweavers

A Fault Detection Service for Wide Area Distributed Computations

Computing Systems | Distributed And Parallel Computing | Distributed Computing | HPDC 1998 | Unreliable Fault Detectors |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers