Cache-based, general purpose CPUs perform at a small fraction of their maximum floating point performance when executing memory-intensive simulations, such as those required for sparse matrix-vector multiplication. This effect is due to the memory bottleneck that is encountered with large arrays that must be stored in dynamic RAM. An FPGA core designed for a target performance that does not unnecessarily exceed the memory imposed bottleneck can be distributed, along with multiple memory interfaces, into a scalable architecture that overcomes the bandwidth limitation of a single interface. Interconnected cores can work together to solve a computing problem and exploit a bandwidth that is the sum of the bandwidth available from all of their connected memory interfaces. This work demonstrates this concept of scalability with two memory interfaces through the use of an available FPGA prototyping platform. It is shown that our reconfigurable approach is scalable as performance roughly doubl...
Russell Tessier, Salma Mirza, J. Blair Perot