In this paper, we propose an architecture for floatingpoint based LU decomposition for large-sized matrices. Our proposed architecture is based on the well known concept of blocking and uses pipelined floating-point units to obtain high throughput. We first analyze the effects of block size and the deeply pipelined floating-point units on the performance of the architecture. We analyze and compare the performance of our double-precision based design with that of a GPP based design. Initial results show that an improvement of upto 23x in the total computation time can be achieved. We then, analyze the impact of algorithm level design (by varying block size) on the system-wide energy dissipation and resource-usage of our designs.
Gokul Govindu, Viktor K. Prasanna, Vikash Daga, Sr