This paper explains why parallel implementation of matrix multiplication--a seemingly simple algorithm that can be expressed as one statement and three nested loops--is complex: Practical algorithms that use matrix multiplication tend to use matrices of disparate shapes, and the shape of the matrices can significantly impact the performance of matrix multiplication. We provide a class of algorithms that covers the spectrum of shapes encountered and demonstrate that good performance can be attained if the right algorithm is chosen. While the paper resolves a number of issues, it concludes with discussion of a number of directions yet to be pursued.
John A. Gunnels, Calvin Lin, Greg Morrow, Robert A