Abstract. We present new performance models and a new, more compact data structure for cache blocking when applied to the sparse matrixvector multiply (SpM×V) operation, y ← y + A · x. Prior work indicates that cache blocked SpM×V performs very well for some matrix and machine combinations, yielding speedups as high as 3x. We look at the general question of when and why performance improves, finding that cache blocking is most effective when simultaneously 1) x does not fit in cache, 2) y fits in cache, 3) the non-zeros are distributed throughout the matrix, and 4) the non-zero density is sufficiently high. We extend our prior performance models, which bounded performance by assuming x and y fit in cache, to consider these classes of matrices. Unlike our prior model, the updated models are accurate enough to use as a heuristic for predicting the optimum block sizes. We conclude with architectural suggestions that would make processor and memory systems more amenable to SpM×V...
Rajesh Nishtala, Richard W. Vuduc, James Demmel, K