Abstract. Most parallel systems on which MPI is used are now hierarchical: some processors are much closer to others in terms of interconnect performance. One of the most common such examples are systems whose nodes are symmetric multiprocessors (including “multicore” processors). A number of papers have developed algorithms and implementations that exploit shared memory on such nodes to provide optimized collective operations, and these show significant performance benefits compared to implementations that do not exploit the hierarchical structure of the nodes. However, shared memory between processes is often a scarce resource. How necessary is it to use shared memory for collectives in MPI? How much of the performance benefit comes from tailoring the algorithm to the hierarchical topology of the system? In this paper, we describe an implementation based entirely on message-passing primitives but that exploits knowledge of the two-level hierarchy. We discuss both rootless coll...