We present an efficient method for mutual information (MI) computation between images (2D or 3D) for NVIDIA’s ‘compute unified device architecture’ (CUDA) compatible devices. Efficient parallelization of MI is particularly challenging on a ‘graphics processor unit’ (GPU) due to the need for histogram-based calculation of joint and marginal probability mass functions (pmfs) with large number of bins. The data-dependent (unpredictable) nature of the updates to the histogram, together with hardware limitations of the GPU (lack of synchronization primitives and limited memory caching mechanisms) can make GPU-based computation inefficient. To overcome these limitation, we approximate the pmfs, using a down-sampled version of the jointhistogram which avoids memory update problems. Our CUDA implementation improves the efficiency of MI calculations by a factor of 25 compared to a standard CPUbased implementation and can be used in MI-based image registration applications.