Global communication costs in future single-chip multiprocessors will increase linearly with distance. In this paper, we revisit the issues of locality and load balance in order to take advantage of these new costs. We present a technique which simultaneously migrates data and threads based on vectors specifying locality and resource usage. This technique improves performance on applications with distinguishable locality and imbalanced resource usage. 64% of the ideal reduction in execution time was achieved on an application with these traits while no improvement was obtained on a balanced application with little locality.
K. A. Shaw, William J. Dally