Traditional directory-based cache coherence protocols suffer from long-latency cache misses as a consequence of the indirection introduced by the home node, which must be accessed...
This paper explores the correlation of instruction counts and cache misses to runtime performance for a large family of divide and conquer algorithms to compute the Walsh–Hadama...
— This paper demonstrates that the algorithmic performance of end user programs may be greatly affected by the two or three level caching scheme of the processor, and we introduc...
Efficient use of the memory hierarchy is critical for achieving high performance in a multiprocessor systemon-chip. An external memory that is shared between processors is a bottl...
Arno Moonen, Marco Bekooij, Rene van den Berg, Jef...
Modern embedded CPU systems rely on a growing number of software features, but this growth increases the memory footprint and increases the need for efficient instruction and data...
The use of large instruction windows coupled with aggressive out-oforder and prefetching capabilities has provided significant improvements in processor performance. In this paper...
We present a cross-layer customization methodology for latency and bandwidth efficient inter-core communication in embedded multiprocessors. The methodology integrates compiler, o...