We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a n...
Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhan...
Exploiting locality at run-time is a complementary approach to a compiler approach for those applications with dynamic memory access patterns. This paper proposes a memory-layout ...
Efficient intra-node shared memory communication is important for High Performance Computing (HPC), especially with the emergence of multi-core architectures. As clusters continue ...
Snoopy cache coherence protocols broadcast requests to all nodes, reducing the latency of cache to cache transfer misses at the expense of increasing interconnect power. We propos...
Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order toperform well, however, and caches require a coherence mechanism t...