The transition to multithreaded, multi-core designs places a greater responsibility on programmers and software for improving performance; thread-level parallelism (TLP) will be increasingly relied upon in addition to instruction-level parallelism (ILP) and increased clock frequency. Deciding where to try to parallelize code is difficult, especially for large, complex applications or those where the original developers have moved on. Outer loops are relatively easy targets for parallelization, but traditional profilers focus primarily on functions and hot inner loops. To aid in programmers' parallelization efforts, we introduce the concept of loop-centric profiling to provide a hierarchical view of how much time is spent in a loop and the loops nested within it. This paper introduces two techniques for loop profiling. First, we describe an instrumentation-based approach that gathers highly detailed and accurate information about loop behavior. Second, we present a sampling approa...
Tipp Moseley, Daniel A. Connors, Dirk Grunwald, Ra