—This paper describes a multi-threaded parallel design and implementation of the Smith-Waterman (SM) algorithm on compute unified device architecture (CUDA)-compatible graphic processing units (GPUs). A novel technique has been put forward to solve the restriction on the length of the query sequence in previous GPU implementations of the Smith-Waterman algorithm. The main reasons behind this limitation in previous GPU implementations were the finite size of local memory and number of threads per block. Our solution to this problem uses a divide and conquer approach to compute the alignment matrix involved in each pairwise sequence alignment, as it divides the entire matrix computation into multiple sub-matrices and allocates the available amount of threads and memory resources to each submatrix iteratively. Intermediate data is stored in shared and global memory on the fly depending on the length of sequences in hand. The proposed technique resulted in up to 4.2 GCUPS (Giga Cell Upda...