Sorting is an important component of many applications, and parallel sorting algorithms have been studied extensively in the last three decades. One of the earliest parallel sorting algorithms is Bitonic Sort, which is represented by a sorting network consisting of multiple butterfly stages. This paper studies bitonic sort on modern parallel machines which are relatively coarse grained and consist of only a modest number of nodes, thus requiring the mapping of many data elements to each processor. Under such a setting optimizing the bitonic sort algorithm becomes a question of mapping the data elements to processing nodes (data layout) such that communication is minimized. We developed a bitonic sort algorithm which minimizes the number of communication steps and optimizes the local computation. The resulting algorithm is faster than previous implementations, as experimental results collected on a 64 node Meiko CS-2 show.
Mihai F. Ionescu