An analysis is presented of the primary factors influencing the performance of a parallel implementation of the UCLA atmospheric general circulation model (AGCM) on distributedmemory, massively parallel computer systems. Several modifications to the original parallel AGCM code aimed at improving its numerical efficiency, load-balance and single-node code performance are discussed. The impact of these optimization strategies on the performance on two of the state-of-the-art parallel computers, the Intel Paragon and Cray T3D, is presented and analyzed. It is found that implementation of a loadbalanced FFT algorithm results in a reduction in overall execution time of approximately 45% compared to the original convolution-based algorithm. Preliminary results of the application of a load-balancing scheme for the Physics part of the AGCM code suggest additional reductions in execution time of 15-20% can be achieved. Finally, several strategies for improving the single-node performance of...
John Z. Lou, John D. Farrara