Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by per-prefetch overheads and the compiler’s limited prefetch scope. Hardware prefetching can be much more effective at hiding level-two cache miss latencies, but generates many useless prefetches and considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, a 23% gain over no prefetching, but SRP incurs 153% extra memory traffic—more...
Zhenlin Wang, Doug Burger, Steven K. Reinhardt, Ka