A blossoming paradigm for block-recursive matrix algorithms is presented that, at once, attains excellent performance measured by • time, • TLB misses, • L1 misses, • L2 misses, • paging to disk, • scaling on distributed processors, and • portability to multiple platforms. It provides a philosophy and tools that allow the programmer to deal with the memory hierarchy invisibly, from L1 and L2 to TLB, paging, and interprocessor communication. Used together, they provide a cacheoblivious style of programming. Plots are presented to support these claims on an implementation of Cholesky factorization crafted directly from the paradigm in C with a few intrinsic calls. The results in this paper focus on low-level performance, including the new Morton-hybrid representation to take advantage of hardware and compiler optimizations. In particular, this code beats Intel’s Matrix Kernel Library and matches AMD’s Core Math Library, losing a bit on L1 misses while winning decisivel...
Michael D. Adams, David S. Wise