The performance potential of a value reuse mechanism depends on its reuse detection time, the number of reuse opportunities, and the amount of work saved by skipping each reuse unit. Since larger instruction groups typically have fewer reuse opportunities than smaller groups, but also provide greater benefit for each reuse-detection process, it is very important to find the balance point that provides the largest overall performance gain. We propose a new mechanism called sub-block reuse to balance the reuse granularity and the number of reuse opportunities. Our simulation results show that sub-block reuse with compiler assistance has a substantial and consistent potential to improve the performance of superscalar processors, with speedups ranging from 10% to 22%.
Jian Huang, David J. Lilja