Optimizing the performance of shared-memory NUMA programs remains something of a black art, requiring that application writers possess deep understanding of their programs’ behaviors. This difficulty represents one of the remaining hindrances to the widespread adoption and deployment of these cost-efficient and scalable shared-memory NUMA architectures. To address this problem, we have developed a performance monitoring infrastructure and a corresponding set of tools to aid in visualizing and understanding the subtleties of the memory access behavior of parallel NUMA applications with large datasets. The tools are designed to be general, interoperable, and easily portable. We give detailed examples of the use of one particular tool in the set. We have used this memory access visualization tool profitably on a range of applications, improving performance by around 90%, on average.
Tao Mu, Jie Tao, Martin Schulz, Sally A. McKee