The performance of SIMD processors is often limited by the time it takes to transfer data between the centralized control unit and the parallel processor array. This is especially true of hybrid SIMD models, such as associative computing, that make extensive use of global search operations. Pipelining instruction broadcast can help, but is not enough to solve the problem, especially for massively parallel processors with thousands of processing elements. In this paper, we describe a SIMD processor architecture that combines a fully pipelined broadcast/reduction network with hardware multithreading to reduce performance degradation as the number of processors is scaled up.
Kevin Schaffer, Robert A. Walker