Optimization of Collective Reduction Operations

15 years 12 months ago

Download cs.anu.edu.au

A 5-year-proﬁling in production mode at the University of Stuttgart has shown that more than 40% of the execution time of Message Passing Interface (MPI) routines is spent in the collective communication routines MPI Allreduce and MPI Reduce. Although MPI implementations are now available for about 10 years and all vendors are committed to this Message Passing Interface standard, the vendors’ and publicly available reduction algorithms could be accelerated with new algorithms by a factor between 3 (IBM, sum) and 100 (Cray T3E, maxloc) for long vectors. This paper presents ﬁve algorithms optimized for different choices of vector size and number of processes. The focus is on bandwidth dominated protocols for power-of-two and non-power-of-two number of processes, optimizing the load balance in communication and computation. Keywords. Message Passing, MPI, Collective Operations, Reduction.

Rolf Rabenseifner

Real-time Traffic