In this paper we consider several hardware implementations of the general-purpose atomic primitives fetch and Φ, compare and swap, load linked, and store conditionalon large-scal...
As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance factor for many applications. Several approaches to bridging ...
—Reducing communication latency, which is a performance bottleneck in optically interconnected multiprocessor systems, is of prominent importance. A conventional approach for est...
Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order toperform well, however, and caches require a coherence mechanism t...
Information on the behavior of programs is essential for deciding the number and nature of functional units in high performance architectures. In this paper, we present studies on...
Lizy Kurian John, Vinod Reddy, Paul T. Hulina, Lee...
The throughput of a multiple-pipelined processor suffers due to lack of sufficient instructions to make multiple pipelines busy and due to delays associated with pipeline depende...