Tuning the performance of applications requires understanding the interactions between code and target architecture. This paper describes a performance modeling approach that not only makes accurate predictions about the behavior of an application on a target architecture for different inputs, but also provides guidance for tuning by highlighting the factors that limit performance in each section of a program. We introduce two new performance metrics that estimate the maximum gain expected from tuning different parts of an application, or from increasing the number of machine resources. We show how this metric helped identify a bottleneck in the ASCI Sweep3D benchmark where the lack of instruction-level parallelism limited performance. Transforming one frequently executed loop to ameliorate this bottleneck improved performance by 16% on an Itanium2 system.
Gabriel Marin, John M. Mellor-Crummey