Software distributed-shared-memory (DSM) systems providean appealingtarget for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronizationand load imbalanceare significantsourcesof overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: 1) we perform inexpensivecommunication analysis through local subscriptanalysis when using chunk iteration partitioning for parallel loops, 2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences, 3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance by 20% ...
Hwansoo Han, Chau-Wen Tseng, Peter J. Keleher