A trace of a workload’s system calls can be obtained with minimal interference, and can be used to drive repeatable experiments to evaluate system configuration alternatives. R...
Scalability to large number of processes is one of the weaknesses of current MPI implementations. Standard implementations are able to scale to hundreds of nodes, but no beyond th...
Felix Freitag, Jordi Caubet, Montse Farreras, Toni...
Efficient performance tuning of parallel programs is often hard. In this paper we describe an approach that uses a uni-processor execution of a multithreaded program as reference ...
Improving memory performance at software level is more effective in reducing the rapidly expanding gap between processor and memory performance. Loop transformations (e.g. loop un...
Surendra Byna, Xian-He Sun, William Gropp, Rajeev ...