The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path lengths of many components in existing implementations grow as (n2 ) where n is the fetch width, the issue width, or the window size. This paper describes two scalable processor architectures, Ultrascalar I and Ultrascalar II, and compares their VLSI complexities (gate delays, wire-length delays, and area.) Both processors are implemented by a large collection of ALUs with controllers (together called execution stations) connected together by a network of parallel-prefix tree circuits. A fat-tree network connects an interleaved cache to the execution stations. These networks provide the full functionality of superscalar processors including renaming, out-of-order execution, and speculative execution. The difference between the processors is in the mechanism used to transmit register values from one execution station to another. Both a...
Bradley C. Kuszmaul, Dana S. Henry, Gabriel H. Loh