Software barriers have been designed and evaluated for barrier synchronization in large-scale shared-memory multiprocessors, under the assumption that all processorsreach the synchronization point simultaneously. When relaxingthis assumption, we demonstrate that the optimum degree of combining trees is not four as previously thought but increases from four to as much as 128 in a 4K system as the load imbalance increases. The optimum degree calculated using our analytic model yields a performance that is within 7% of the optimum obtained by exhaustive simulation with a rangeof degrees. We also investigate a dynamic placementbarrierwhereslow processorsmigrate toward the root of the software combining tree. We show that through dynamic placement the synchronization delay can be reduced by a factor close to the depth of the tree, when sufficient slack is available. By choosing a suitable tree degreeand using dynamic placement, software barriers that are scalable to large numbers of proce...
Alexandre E. Eichenberger, Santosh G. Abraham