Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, ...
Scott Rixner, William J. Dally, Brucek Khailany, P...
Efficiency of synchronization mechanisms can limit the parallel performance of many shared-memory applications. In addition, the ever increasing performance gap between processor...
Value prediction is a technique that breaks true data dependences by predicting the outcome of an instruction, and executes speculatively its data-dependent instructions based on ...
This paper presents flit-reservation flow control, in which control flits traverse the network in advance of data flits, reserving buffers and channel bandwidth. Flit-reservation ...
Abstract—Sharing patterns in shared-memory multiprocessors are the key to performance: uniprocessor latencytolerating techniques such as out-of-order execution and non-blocking c...
Prefetching offers the potential to improve the performance of linked data structure (LDS) traversals. However, previously proposed prefetching methods only work well when there i...
Magnus Karlsson, Fredrik Dahlgren, Per Stenstr&oum...
This paper describes a new instruction-supply mechanism, called the eXtended Block Cache (XBC). The goal of the XBC is to improve on the Trace Cache (TC) hit rate, while providing...
Clustered microarchitectures are an effective approach to reducing the penalties caused by wire delays inside a chip. Current superscalar processors have in fact a two-cluster mic...
Ramon Canal, Joan-Manuel Parcerisa, Antonio Gonz&a...
The paper presents PowerMANNA - a distributed-memory parallel computer system based on the 64-Bit PowerPC processor MPC620. The PowerMANNA node architecture supports all the sophi...
With increasing chip densities, future microprocessor designs have the opportunity to integrate many of the traditional systemlevel modules onto the same chip as the processor. So...