Abstract—Programmable accelerators (PA) are receiving increased attention in domain-specific architecture designs to provide more general support for customization. In a PA-rich system, computational kernels are compiled into predefined PA templates and dynamically mapped to real PAs at runtime. This imposes a demanding challenge on the compiler side – that is, how to generate high-quality PA mapping code. Another important concern is the communication cost among PAs: if not handled properly at compile time, data transfers among tens or hundreds of accelerators in a PA-rich system will limit the overall performance gain. In this paper we present an efficient PA compilation flow, which is scalable for mapping large computation kernels into PA-rich architectures. Communication overhead is modeled and optimized in the proposed flow to reduce runtime data transfers among accelerators. Experimental results show that for 12 computation-intensive standard benchmarks, the proposed appr...