Abstract--As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profili...
The architecture for a shared memory CPU is described. The CPU allows for parallelism down to the level of single instructions and is tolerant of memory latency. All executable in...
This paper introduces a new highly optimized architecture for remote memory access (RMA). RMA, using put and get operations, is a one-sided communication function which amongst ot...
This paper focuses on the Cyclops64 computer architecture and presents an analytical model and performance simulation results for the preloading and loop unrolling approaches to op...
Yanwei Niu, Ziang Hu, Kenneth E. Barner, Guang R. ...
Modern DRAM systems rely on memory controllers that employ out-of-order scheduling to maximize row access locality and bank-level parallelism, which in turn maximizes DRAM bandwid...