Cache optimizations typically include code transformations to increase the locality of memory accesses. An orthogonal approach is to enable for latency hiding by introducing prefetching techniques. With software prefetching, cache load instructions have to be inserted into the program code. To overcome this complexity for the programmer, modern processers are equipped with hardware prefetching units which predict future memory accesses in order to automatically load data into cache before its use. For optimal performance, it seems advantageous to combine both prefetching approaches. In this contribution, we first use a cache simulation enhanced with a simple hardware prefetcher to run code for a 3D multigrid solver. Cache misses which are not predicted by the prefetcher can be located in simulation results, and selectively, software prefetch instructions can be inserted. However, when performance of a code section is limited by available bandwidth to main memory, this simple strategy ...