As the disparity between processor and main memory performance grows, the number of execution cycles spent waiting for memory accesses to complete also increases. As a result, lat...
Teresa L. Johnson, Matthew C. Merten, Wen-mei W. H...
Improving memory performance at software level is more effective in reducing the rapidly expanding gap between processor and memory performance. Loop transformations (e.g. loop un...
Surendra Byna, Xian-He Sun, William Gropp, Rajeev ...
This paper introduces a new highly optimized architecture for remote memory access (RMA). RMA, using put and get operations, is a one-sided communication function which amongst ot...
The MPI Standard supports derived datatypes, which allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. Thi...
Surendra Byna, William D. Gropp, Xian-He Sun, Raje...
Given the large communication overheads characteristic of modern parallel machines, optimizations that eliminate, hide or parallelize communication may improve the performance of ...