We introduce a new register file architecture that provides both row-wise and column-wise accesses, thus allowing partitioned instructions to be used in columnwise processing without transposition overhead. This feature can accelerate 2D separable image and video processing algorithms, such as 2D convolution and 2D discrete cosine transform (DCT), by eliminating the transposition steps.
Yoochang Jung, Stefan G. Berg, Donglok Kim, Yongmi