Co-designing the failure analysis and monitoring of large-scale systems

15 years 6 months ago

Download hotmetrics08.cs.umn.edu

Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via widearea networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of moni...

Abhishek Chandra, Rohini Prinja, Sourabh Jain, Zhi

Real-time Traffic

Failure Analysis | Hardware | Numerous Distributed Applications | SIGMETRICS 2008 | Systems |

claim paper

» Semantic Routing and Filtering for LargeScale Video Streams Monitoring

» An analysis of a large scale habitat monitoring application

» Performance Implications of Failures in LargeScale Cluster Scheduling

» ClusterBased Failure Detection Service for LargeScale Ad Hoc Wireless Network Applications

» Monitoring and Debugging Parallel Software with BCSMPI on LargeScale Clusters

» Grassroots Approach to Selfmanagement in LargeScale Distributed Systems

» Mining Console Logs for LargeScale System Problem Detection

» Prototype of Fault Adaptive Embedded Software for LargeScale RealTime Systems

» A LargeScale Industrial Case Study on ArchitectureBased Software Reliability Analysis

Post Info
More Details (n/a)

Added	15 Dec 2010
Updated	15 Dec 2010
Type	Journal
Year	2008
Where	SIGMETRICS
Authors	Abhishek Chandra, Rohini Prinja, Sourabh Jain, Zhi-Li Zhang

Comments (0)

Sciweavers

Co-designing the failure analysis and monitoring of large-scale systems

Failure Analysis | Hardware | Numerous Distributed Applications | SIGMETRICS 2008 | Systems |

Explore & Download

Productivity Tools

Sciweavers