In this paper we consider several hardware implementations of the general-purpose atomic primitives fetch and Φ, compare and swap, load linked, and store conditionalon large-scale shared-memory multiprocessors. These primitives have proven popular on small-scale bus-based machines, but have yet to become widely available on large-scale, distributed shared memory machines. We propose several alternative hardware implementations of these primitives, and then analyze the performance of these implementations for various data sharing patterns. Our results indicate that good overall performance can be obtained by implementing compare and swap in the cache controllers, and by providing an additional instruction to load an exclusive copy of a cache line.
Maged M. Michael, Michael L. Scott