—The 2D Discrete Wavelet Transform (DWT) is a time-consuming kernel in many multimedia applications such as JPEG2000 and MPEG-4. The 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. The vertical filtering is easy to vectorize (assuming row-major order), but to vectorize the horizontal filtering many overhead instructions are required. In this paper we propose some SIMD architectural enhancements, such as the MAC operation, extended subwords, and the matrix register file technique, to develop high-performance implementations of the 2D DWT on SIMD architectures. The MAC operation performs four 32-bit single-precision floating-point multiplications with accumulation. The matrix register file allows to load data stored consecutively in memory to a column of the register file, where a column corresponds to corresponding subwords of different registers. These techniques avoid the need of data rearrangement instructions. In add...
Asadollah Shahbahrami, Ben H. H. Juurlink