To observe, analyze and control large scale distributed systems and the applications hosted on them, there is an increasing need to continuously monitor performance attributes of ...
Shicong Meng, Srinivas R. Kashyap, Chitra Venkatra...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...
LOFAR is the first of a new generation of radio telescopes, that combines the signals from many thousands of simple, fixed antennas, rather than from expensive dishes. Its revol...
John W. Romein, P. Chris Broekema, Ellen van Meije...
— Frequent failure occurrences are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in ...
In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, M...