—Large scale systems (LSS) contain multiple subsystems that interact across multiple nodes in sometimes unforeseen and complicated ways. As a result, pinpointing the subsystems that are the source of performance degradation for a load test in LSS can be frustrating, and might take several hours or even days. This is due to the large volume of performance counter data collected such as CPU utilization, Disk I/O, memory consumption and network traffic, the limited operational knowledge of analysts about all subsystems of an LSS and the unavailability of up-todate documentation in a LSS. We have developed a methodology that automatically ranks the subsystems according to the deviation of their performance in a load test. Our methodology uses performance counter data of a load test to craft performance signatures for the LSS subsystems. Pair-wise correlations among the performance signatures of subsystems within a load test are compared with the corresponding correlations in a baseline t...
Haroon Malik, Bram Adams, Ahmed E. Hassan