We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as e through the syst...
Mike Y. Chen, Anthony Accardi, Emre Kiciman, David...
: Quantitative models are needed for a variety of management tasks, including (a) identification of critical variables to use for health monitoring, (b) anticipating service level...
Yixin Diao, Frank Eskesen, Steve Froehlich, Joseph...
In production Grids for scientific applications, service and resource failures must be detected and addressed quickly. In this paper, we describe the monitoring infrastructure use...
Ann L. Chervenak, Jennifer M. Schopf, Laura Pearlm...
Abstract--This paper seeks to understand how network failures affect the availability of service delivery across wide-area networks (WANs) and to evaluate classes of techniques for...
Bharat Chandra, Michael Dahlin, Lei Gao, Amol Naya...
Large scale distributed systems typically have interactions among different services that create an avenue for propagation of a failure from one service to another. The failures ...