Abstract. This paper presents an application-level non-blocking multicast scheme for dynamic DAG scheduling on large-scale distributedmemory systems. The multicast scheme takes into account both network topology and space requirement of routing tables to achieve scalability. Specifically, we prove that the scheme is deadlock-free and takes at most logN steps to complete. The routing table chooses appropriate neighbors to store based on topology IDs and has a small space of O(logN). Although built upon MPI point-to-point operations, the experimental results show that our scheme is significantly better than the simple flattree method and is comparable to vendor’s collective MPI operations.