The scalable parallel implementation, targeting SMP and/or multicore architectures, of dense linear algebra libraries is analyzed. Using the LU factorization as a case study, it is shown that an algorithmby-blocks exposes a higher degree of parallelism than traditional implementations based on multithreaded BLAS. The implementation of this algorithm using the SuperMatrix runtime system is discussed and the scalability of the solution is demonstrated on two different platforms with 16 processors. Key words: Dense linear algebra libraries, high-level APIs, run-time system, multithreaded architectures, LU factorization.
Gregorio Quintana-Ortí, Enrique S. Quintana