Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths re...
Many of the recently proposed techniques to reduce power consumption in caches introduce an additional level of nondeterminism in cache access latency. Due to this additional late...
Soontae Kim, Narayanan Vijaykrishnan, Mary Jane Ir...
—The contribution of memory latency to execution time continues to increase, and latency hiding mechanisms become ever more important for efficient processor design. While high-...
Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically req...
Haitham Akkary, Ravi Rajwar, Srikanth T. Srinivasa...
We have fabricated a Chip Multiprocessor prototype code-named Merlot to proof our novel speculative multithreading architecture. On Merlot, multiple threads provide wider issue wi...