Motivated by the need for high throughput sphere decoding for multipleinput-multiple-output (MIMO) communication systems, we propose a parallel depth-first sphere decoding (PDSD) algorithm that provides the advantages of both parallel processing and rapid search space reduction. The PDSD algorithm is designed for efficient implementation on programmable multi-processor platforms. We investigate the trade-off between the throughput and computation overhead when the number of processing elements is 2, 4 and 8, for a 4× 4 16-QAM system across a wide range of SNR conditions. Through simulation, we show that PDSD can offer significant throughput improvement without incurring substantial computation overhead by selecting the appropriate number of processing elements according to specific SNR conditions.