The number and complexity of distributed applications has exploded, and to-date, each has had to create its own method for providing diagnostic tools and performance metrics. Thes...
Artemis is a modular application designed for analyzing and troubleshooting the performance of large clusters running datacenter services. Artemis is composed of four modules: (1)...
Gabriela F. Cretu-Ciocarlie, Mihai Budiu, Mois&eac...
The console logs generated by an application contain messages that the application developers believed would be useful in debugging or monitoring the application. Despite the ubiq...
Wei Xu, Ling Huang, Armando Fox, David A. Patterso...
We develop a machine-learned similarity metric for Windows failure reports using telemetry data gathered from clients describing the failures. The key feature is a tuned callstack...
Kevin Bartz, Jack W. Stokes, John C. Platt, Ryan K...
Internet routing is mostly based on static information-it's dynamicity is limited to reacting to changes in topology. Adaptive performance-based routing decisions would not o...
Ioannis C. Avramopoulos, Jennifer Rexford, Robert ...
Previous work showed that statistical analysis techniques could successfully be used to construct compact signatures of distinct operational problems in Internet server systems. B...
Automated techniques to diagnose the cause of system failures based on monitoring data is an active area of research at the intersection of systems and machine learning. In this p...
Although queueing models have long been used to model the performance of computer systems, they are out of favor with practitioners, because they have a reputation for requiring u...