In a previous paper we show how the FLAME methods and tools provide a solution to compute dense dense linear algebra operations on a multi-GPU platform with reasonable performance while requiring little programming effort. In this paper we generalize the approach for systems with multiple hardware accelerators, and incorporate software implementations of standard cache/memory coherence techniques from computer architecture to improve the performance. Our experimental evaluation on an NVIDIA Tesla S870 platform delivers a peak performance well over 400 GFLOPS.
Enrique S. Quintana-Ortí, Francisco D. Igua