Abstract. This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how t...
Ziang Hu, Juan del Cuvillo, Weirong Zhu, Guang R. ...
The architecture for a shared memory CPU is described. The CPU allows for parallelism down to the level of single instructions and is tolerant of memory latency. All executable in...
Lowering power consumption in microprocessors, whether used in portables or not, has now become one of the most critical design concerns. On-chip cache memories tend to occupy dom...
Sangyeun Cho, Wooyoung Jung, Yongchun Kim, Seh-Woo...
In a Chip Multi-Processor (CMP) with private caches, the last level cache is statically partitioned between all the cores. This prevents such CMPs from sharing cache capacity in r...
We propose two novel integration techniques -- bypass and bookkeeping -- in the memory controller to address the cache coherence compatibility issue of a non-shared bus heterogene...