The emergence of high-density reconfigurable hardware devices gives scientists and engineers an option to accelerating their numerical computing applications on low-cost but powerful “FPGA-enhanced computers”. In this paper, we introduced our efforts towards improving the computational performance of Basic Linear Algebra Subprograms (BLAS) by FPGA-specific algorithms/methods. Our study focus on three BLAS subroutines: floating point summation, matrix-vector multiplication, and matrix-matrix multiplication. They represent all three levels of BLAS functionalities, and their sustained computational performances are either memory bandwidth bounded or computation bounded. By proposing the group-alignment based floating-point summation method and applying this technique to other subroutines, we significantly improved their sustained computational performance and reduced numerical errors with moderate FPGA resources consumed. Comparing with existing FPGA-based implementations, our design...
Chuan He, Guan Qin, Richard E. Ewing, Wei Zhao