The locality of the data in parallel programs is known to have a strong impact on the performance of distributed-memory multiprocessor systems. The worse the locality in access pattern, the worse the performance of singlethreaded multiprocessor systems. The main reason is that a lower locality increases the latency for network messages, so a processor waiting for these messages idles for long periods. A good data-partitioning strategy strives to improve the locality of accesses by reducing the data sharing and the network trac. A certain amount of data sharing, however, is a must for any non-trivial parallel program. So to tune the performance of multiprocessor systems, compilers and programmers expend signicant eort to improvethe data partitioning. The technique of multithreading has been promoted as an eective mechanism to hide inter-processor communication and remote data access latencies by quickly switching among a set of ready threads. In this paper, we show that multithreading...
Xinmin Tian, Shashank S. Nemawarkar, Guang R. Gao,