This paper describes a source to source compilation tool for optimizing MPI-based parallel applications. This tool is able to automatically apply a “prepushing” transformation that causes MPI programs to aggressively send data as soon as it is available, thus improving communicationcomputation overlap and improving application performance. In this paper we present asphalt transformer; the Open64-based component of our framework, ASPhALT, responsible for automatically performing the prepushing transformation. We also present an extensive study of the performance gains witnessed from automatically transformed codes. In particular, we demonstrate how different levels of aggregation affect the performance of parallel programs executing various computation kernels on different clusters. Furthermore, we discuss the differences in performance improvement between the hand-optimized and automatically optimized codes, as well as the effect of automation on time-to-solution.
Anthony Danalis, Lori L. Pollock, D. Martin Swany