Data-parallel programs are both growing in importance and increasing in diversity, resulting in specialized processors targeted at specific classes of these programs. This paper ...
Karthikeyan Sankaralingam, Stephen W. Keckler, Wil...
Data prefetching effectively reduces the negative effects of long load latencies on the performance of modern processors. Hardware prefetchers employ hardware structures to predic...
Jamison D. Collins, Suleyman Sair, Brad Calder, De...
Throughput servers using simultaneous multithreaded (SMT) processors are becoming an important paradigm with products such as Sun's Niagara and IBM Power5. Unfortunately, thr...
The cache hierarchy design in existing SMT and superscalar processors is optimized for latency, but not for bandwidth. The size of the L1 data cache did not scale over the past de...
Traditional vector architectures have shown to be very effective for regular codes where the compiler can detect data-level parallelism. However, this SIMD parallelism is also pre...