Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate branch prediction and low I-cache miss ratios are essential for the e cient operation of the fetch unit. Several studies on cache design and branch prediction address this problem. However, these techniques are not su cient. Even in the presence of e cient cache designs and branch prediction, the fetch unit must continuously extract multiple, non-sequential instructions from the instruction cache, realign these in the proper order, and supply them to the decoder. This paper explores solutions to this problem and presents several schemes with varying degrees of performance and cost. The most-general scheme, the collapsing bu er, achieves near-perfect performance and consistently aligns instruct...
Thomas M. Conte, Kishore N. Menezes, Patrick M. Mi