Recent research in embedded computing indicates that packing multiple processor cores on the same die is an effective way of utilizing the ever-increasing number of transistors. The advantage of placing multiple cores into a single die is that it reduces on-chip communication costs (in terms of both execution cycles and power consumption) between the processor cores that are traditionally very high in conventional high-performance parallel architectures (such as SMPs). However, on the negative side, this tighter integration exerts an even higher pressure on off-chip accesses to the memory system. This makes minimizing the number of off-chip accesses a critical optimization goal. This paper discusses a compiler-based solution to this problem for the embedded applications that perform stencil computations. An important characteristic of this solution is that it distinguishes between the intra-processor data reuse and inter-processor data reuse. The first of these captures the data reus...
Guilin Chen, Mahmut T. Kandemir