The architecture for a shared memory CPU is described. The CPU allows for parallelism down to the level of single instructions and is tolerant of memory latency. All executable in...
A large multi-ported register file is indispensable for exploiting instruction level parallelism (ILP) in today's dynamically scheduled superscalar processors. The number of ...
The cache hierarchy design in existing SMT and superscalar processors is optimized for latency, but not for bandwidth. The size of the L1 data cache did not scale over the past dec...
The increasing complexity of hardware features for recent processors makes high performance code generation very challenging. In particular, several optimization targets have to b...