Optimization for performance and energy for batched matrix computations on GPUs

10 years 3 months ago

Download www.netlib.org

As modern hardware keeps evolving, an increasingly eﬀective approach to develop energy eﬃcient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to ﬁve times more energy eﬃcient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively beneﬁt from the GPU’s signiﬁcantly higher energy eﬃciency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do no...

Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stani

Real-time Traffic

Distributed And Parallel Computing | PPOPP 2015 |

claim paper

Post Info
More Details (n/a)

Added	16 Apr 2016
Updated	16 Apr 2016
Type	Journal
Year	2015
Where	PPOPP
Authors	Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, Jack J. Dongarra

Comments (0)

Sciweavers

Optimization for performance and energy for batched matrix computations on GPUs

Distributed And Parallel Computing | PPOPP 2015 |

Explore & Download

Productivity Tools

Sciweavers