As microprocessors continue to evolve, many optimizations reach a point of diminishing returns. We introduce HLS, a hybrid processor simulator which uses statistical models and sy...
Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously-seen program values, has been studied enthusiastically in the...
Spatial-lattice computations with finite-range interactions are an important class of easily parallelized computations. This class includes many simple and direct algorithms for ...
Communication in cache-coherent distributed shared memory (DSM) often requires invalidating (or writing back) cached copies of a memory block, incurring high overheads. This paper...
Trace caches enable high bandwidth, low latency instruction supply, but have a high miss penalty and relatively large working sets. Consequently, their performance may suffer due ...
Our program benchmarks and simulations of novel circuits indicate that large-window processors are feasible. Using our redesigned superscalar components, a large-window processor ...
Dana S. Henry, Bradley C. Kuszmaul, Gabriel H. Loh...
As DRAM access latencies approach a thousand instructionexecution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue ...
This paper evaluates performance characteristics of the Compaq ES40 shared memory multiprocessor. The ES40 system contains up to four Alpha 21264 CPU’s together with a high-perf...
The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor genera...