Efficiency of synchronization mechanisms can limit the parallel performance of many shared-memory applications. In addition, the ever increasing performance gap between processor and interprocessor communication may further compromise the scalability of these primitives. Ideally, synchronization primitives should provide high performance under both high and low contention without requiring substantial programmer effort and software support. QOLB has been shown to offer substantial speedups and to outperform other synchronization primitives consistently [17], but at the cost of software support and protocol complexity. This paper proposes the use of speculation and delays to implement a purely hardware-based queueing mechanism called Implicit QOLB. Making use of the pervasiveness of the Load-Linked/Store-Conditional primitives, we present a series of hardware mechanisms to optimize performance for sharing patterns exhibited by locks and associated data. The mechanisms do not require a...
Ravi Rajwar, Alain Kägi, James R. Goodman