— This paper describes a design and implementation of the Smith-Waterman algorithm accelerated on the graphics processing unit (GPU). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip memory and processing elements in the GPU. Furthermore, it reduces the number of data fetches by applying a data reuse technique to query and database sequences. We show some experimental results comparing the proposed method with an OpenGL-based method. As a result, the speedup over the OpenGL-based method reaches a factor of 6.4 when using amino acid sequence database. We also find that shared memory reduces the amount of data fetches to 1/140, providing a peak performance of 5.65 giga cell updates per second (GCUPS). This performance is approximately three times faster than a prior CUDA-based implementation.