Subword parallelism has succeeded in accelerating many multimedia applications. Subword permutation instructions have been proposed to efficiently rearrange subwords in or among registers. Bit-level permutation instructions have also been proposed recently for their importance in cryptography. However, some important algorithms, especially ones with lots of conditional control dependencies such as sorting, have not exploited the advantage of subword parallel instructions. In this paper, we show how one of the bit permutation instructions, GRP, can be used for fast sorting. In the process, we demonstrate the versatility of this permutation instruction for uses other than bit permutations. This versatility is important in considering the addition of a new instruction to a general-purpose processor. The results show that our sorting methods have a significant speedup even when compared with the fastest sorting algorithms. We also discuss the hardware implementation of the GRP instruction...
Zhijie Shi, Ruby B. Lee