The composition tree of a given function, when it exists, provides a representation of the function revealing all possible disjunctive decompositions, thereby suggesting a realiza...
In this paper we evaluate the performance of high bandwidth caches that employ multiple ports, multiple cycle hit times, on-chip DRAM, and a line buffer to find the organization t...
Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, t...
Current microprocessors aggressively exploit instructionlevel parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent work ...
Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abd...
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, ...
Subbarao Palacharla, Norman P. Jouppi, James E. Sm...
This work provides a systematic study of the impact of communication performance on parallelapplications in a high performance network of workstations. We develop an experimental ...
Richard P. Martin, Amin Vahdat, David E. Culler, T...
The SGI Origin 2000 is a cache-coherent non-uniform memory access (ccNUMA) multiprocessor designed and manufactured by Silicon Graphics, Inc. The Origin system was designed from t...
Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an i...
Improvements in main memory speeds have not kept pace with increasing processor clock frequency and improved exploitation of instruction-level parallelism. Consequently, the gap b...