Memory models like SC, TSO, and PC enforce load-load ordering, requiring that loads from any single thread appear to occur in program order to all other threads. Out-of-order execution can violate load-load ordering. Conventional multi-processors with out-of-order cores detect load-load ordering violations by snooping an age-ordered load queue on cache invalidations or evictions--events that act as proxies for the completion of remote stores. This mechanism becomes less efficient in an SMT processor, as every completing store must search the loads queue segments of all other threads. This inefficiency exists because store completions from other threads in the same core are not filtered by the cache and coherence protocol: thread 0 observes all of thread 1's stores, not only the first store to every cache line. SMT-Directory eliminates this overhead by implementing the filtering traditionally provided by the cache in the cache itself. SMT-Directory adds a per-thread "read"...
A. Hilton, A. Roth