This paper presents high-performance collective communication algorithms and implementations that exploit the unique architectural features of the Cell heterogeneous multicore processor. This paper specifically describes novel algorithms for the barrier, broadcast, reduce, all-reduce, and all-gather collective operations, and shows the efficiency of these by comparing them to the previous fastest known implementations of these operations targeting the Cell. The new implementations are faster than the published stateof-the-art, achieving up to 19.21 times the performance (95% reduction in latency) of the previous published collective communication work for the Cell [19, 25]. The results presented show performance both within a chip and across the two Cell chips on a Cell blade [10]. Categories and Subject Descriptors
Qasim Ali, Samuel P. Midkiff, Vijay S. Pai