We demonstrate that data reordering can substantially improve the performance of fine-grained irregular sharedmemory benchmarks, on both hardware and software shared-memory syste...
This paper evaluates several mechanisms for repairing the return-address stack after branch mispredictions. The return-address stack is a small but important structure for achievi...
Kevin Skadron, Pritpal S. Ahuja, Margaret Martonos...
Most current single-chip processors employ an on-chip instruction cache to improve performance. A miss in this insk-uction cache will cause an external memory reference which must...
—The 2D Discrete Wavelet Transform (DWT) is a time-consuming kernel in many multimedia applications such as JPEG2000 and MPEG-4. The 2D DWT consists of horizontal filtering alon...
Conventional high-level synthesis uses the worst case delay to relate all inputs to all outputs of an operation. This is a very conservative approximation of reality, especially i...