The memory consistency model in parallel programming controls the order in which operations performed by one thread may be observed by another. Language designers have been reluctant to use the most natural model, sequential consistency, where accesses appear to take effect in the order originally specified, due to performance concerns. We present evidence for its practicality by showing that advanced compiler analyses can eliminate most memory fences and enable highlevel optimizations. Our analyses eliminated nearly all of the memory fences needed by a naive implementation, accounting for most of the dynamically encountered fences in all but one benchmark. We additionally consider two specific optimizations that sequential consistency can prevent, and show that our most aggressive analysis is able to obtain the same performance as the relaxed model when applied to two linear algebra kernels. We believe these results provide important evidence on the viability of sequential consiste...
Amir Kamil, Jimmy Su, Katherine A. Yelick