Realizing scalable cache coherence in the many-core era comes with a whole new set of constraints and opportunities. It is widely believed that multi-hop, unordered on-chip networ...
Linked data structure (LDS) accesses are critical to the performance of many large scale applications. Techniques have been proposed to prefetch such accesses. Unfortunately, many...
Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, and latencies fr...
Prior research demonstrates that temporal memory streaming and related address-correlating prefetchers improve performance of commercial server workloads though increased memory l...
Thomas F. Wenisch, Michael Ferdman, Anastasia Aila...
New media and signal processing applications demand ever higher performance while operating within the tight power constraints of mobile devices. A range of hardware implementatio...
Kevin Fan, Manjunath Kudlur, Ganesh S. Dasika, Sco...
Interconnection plays an important role in performance and power of CMP designs using deep sub-micron technology. The network-on-chip (NoCs) has been proposed as a scalable and hi...
Bo Zhao, Jun Yang 0002, Xiuyi Zhou, Yi Xu, Youtao ...
The verification of an execution against memory consistency is known to be NP-hard. This paper proposes a novel fast memory consistency verification method by identifying a new na...
Yunji Chen, Yi Lv, Weiwu Hu, Tianshi Chen, Haihua ...
The benefits of Out of Order (OOO) processing are well known, as is the effectiveness of predicated execution for unpredictable control flow. However, as previous research has dem...
Fine-grained dynamic voltage/frequency scaling (DVFS) is an important tool in managing the balance between power and performance in chip-multiprocessors. Although manufacturing pr...
As the last-level on-chip caches in chip-multiprocessors increase in size, the physical locality of on-chip data becomes important for delivering high performance. The non-uniform...