– Efficient implementations of the Discrete Fourier Transform (DFT) for GPUs provide good performance with large data sizes, but are not competitive with CPU code for small data sizes. On the other hand, several applications perform multiple DFTs on small data sizes. In fact, even algorithms for large data sizes use a divide-andconquer approach, where eventually small DFTs need to be performed. We discuss our DFT implementation, which is efficient for multiple small DFTs. One feature of our implementation is the use of the asymptotically slow matrix multiplication approach for small data sizes, which improves performance on the GPU due to its regular memory access and computational patterns. We combine this algorithm with the mixed radix algorithm for 1-D, 2-D, and 3-D complex DFTs. We also demonstrate the effect of different optimization techniques. When GPUs are used to accelerate a component of an application running on the host, it is important that decisions taken to optimize th...
S. Mitra, A. Srinivasan