For each computer system generation, there are always applications or workloads for which the main memory size is the major limitation. On the other hand, in many cases, one could...
Hardware prefetching is a simple and effective technique for hiding cache miss latency and thus improving the overall performance. However, it comes with addition of prefetch buff...
This paper introduces a programming interface called PARRAY (or Parallelizing ARRAYs) that supports system-level succinct programming for heterogeneous parallel systems like GPU c...
Accurately characterizing the resource usage of an application at various levels in the memory hierarchy has been a long-standing research problem. Existing characterization studi...
The cache hierarchy design in existing SMT and superscalar processors is optimized for latency, but not for bandwidth. The size of the L1 data cache did not scale over the past de...