Multi-core processors based SMP servers have become building blocks for Linux clusters in recent years because they can deliver better performance for multi-threaded programs through on-chip multi-threading. However, a relative slow software barrier can hinder the performance of a data-parallel scientific application on a multi-core system. In this paper we study the performance of different software barrier algorithms on a server based on newly introduced AMD quad-core Opteron processors. We study how the memory architecture and the cache coherence protocol of the system influence the performance of barrier algorithms. We present an optimized barrier algorithm derived from the queue-based barrier algorithm. We find that
Jie Chen, William A. Watson III