Abstract-- The performance of collective communication operations is known to have a significant impact on the scalability of some applications. Indeed, the global, synchronous nature of some collective operations directly implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. This fact has led many researchers to try to improve the efficiency of collective operations. One popular approach improves the implementation of MPI collective operations by using intelligent or programmable network interfaces to offload the burden of communication activities from the host processor(s). Such implementations have shown significant improvement for microbenchmarks that isolate collective communication performance, but these results have not been shown to translate to significant increases in performance for real applications. In order for collective offload implementations to benefit real applications, a greater understanding of application behavior is needed. ...