Abstract— In this paper, we propose partly parallel architectures based on optimal overlapped sum-product (OSP) decoding. To ensure high throughput and hardware utilization efficiency, partly parallel parity check and pipelined access to memory are utilized. Impacts of different node update algorithms and quantization schemes are studied. FPGA implementation of our proposed architectures for a (1536, 768) (3, 6)-regular QC LDPC code can achieve an estimated 61 Mbps decoding throughput at SNR= 4.5 dB. Finally, noncoherent OSP decoder, which does not always satisfy the data dependency constraints, is proposed to ensure that the maximum throughput gain 2 of the OSP decoding is achieved for all QC LDPC codes.