Abstract—Unacceptable execution time of Non-rigid registration (NRR) often presents a major obstacle to its routine clinical use. Parallel computing is an effective way to accelerate NRR. However, development of efficient parallel NRR codes is a very challenging task. One desirable approach is to map the existing sequential algorithm to the parallel architecture to gain speedup instead of designing a new parallel algorithm. Multicores and GPU provide us a cooperative architecture, in which both Single Instruction Multiple Data (SIMD) and Single Program Multiple Data (SPMD) programming models can co-exist and complement each other. We present a method to parallelize a NRR on this cooperative architecture. Our approach is first to separate the sequential algorithm into regular and irregular parts. We then map the regular part on GPU following SIMD paradigm and irregular part on multicores in a SPMD fashion. Unlike the approaches that use multicores or GPU alone, our approach leads to...