The efforts of an expert to parallelize and optimize a dense linear algebra algorithm for distributed-memory targets are largely mechanical and repetitive. We demonstrate that the...
Bryan Marker, Andy Terrel, Jack Poulson, Don S. Ba...
: Tuning numerical libraries has become more difficult over time, as systems get more sophisticated. In particular, modern multicore machines make the behaviour of algorithms hard ...
Emmanuel Agullo, Jack Dongarra, Rajib Nath, Stanim...
—Much of dense linear algebra has been successfully blocked to concentrate the majority of its time in the Level 3 BLAS, which are not only efficient for serial computation, but...
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such cal...
The scalable parallel implementation, targeting SMP and/or multicore architectures, of dense linear algebra libraries is analyzed. Using the LU factorization as a case study, it is...
—We present an application programming interface (API) for the C programming language that facilitates the development of dense linear algebra algorithms on graphics processors a...