Emerging multi-core processors are able to accelerate medical imaging applications by exploiting the parallelism available in their algorithms. We have implemented a mutual-informationbased 3D linear registration algorithm on the Cell Broadband Engine™ processor. By exploiting the highly parallel architecture and its high memory bandwidth, our implementation with two CBE processors can register a pair of 256x256x30 3D images in one second. This implementation is significantly faster than a conventional one on a traditional microprocessor or even faster than a previously reported custom-hardware implementation. In addition to parallelizing the code for multiple cores and organizing the data structure for reducing the amount of the memory traffic, it is also critical to optimize the code for the SIMD pipeline structure. We note that code optimization for the SIMD pipeline alone results in a 4.2x-8.7x acceleration for the computation of small kernels. Further, SIMD optimization alone r...