There is a growing need for systems that can monitor and analyze application performance data automatically in order to deliver reliable and sustained performance to applications. However, the continuously growing complexity of high performance computer systems and applications makes this process difficult. We introduce a statistical data reduction method that can be used to guide the selection of system metrics that are both necessary and sufficient to describe observed application behavior, thus reducing the instrumentation perturbation and data volume to be managed. To evaluate our strategy, we applied it to one CPU-bound Grid application using cluster machines and GridFTP data transfer in a wide area testbed. A comparative study shows that our strategy produces better results than other techniques. It can reduce the number of system metrics to be managed by about 80%, while still capturing enough information for performance predictions.
Lingyun Yang, Jennifer M. Schopf, Catalin Dumitres