Matrix multiplication is a basic computing operation. Whereas it is basic, it is also very expensive with a straight forward technique of O(N3 ) runtime complexity. More complex solutions such as Strassen's algorithm exist that reduce this complexity to O(Nlog2 7 ); the recursive nature of such algorithms place a large burden on memory systems due to temporary storage and the lack of locality in their access patterns In this paper we propose a scheme for reordering the matrix entries stored in memory. This reordering provides two major benefits: a simple method to transform the recursive algorithm into an iterative one, and also a simple method for maintaining memory locality over the entire operation. These two features both provide an improvement in performance that grows as the problem size increases. The proposed reordering scheme has been implemented in C. Testing of our C implementation, which eliminates the need for unnecessary storage of matrix elements from previous iter...
Hossam A. ElGindy, George Ferizis