Decision tree-based context clustering is the essential but timeconsuming part of building HMM-based speech synthesis systems. The widely used implementation of this technique is not designed to take advantage of highly parallel architectures, such as GPUs. This paper describes an implementation of decision tree-based context clustering for these highly parallel architectures. Experimental results showed that the new implementation running on GPUs was significantly faster than the conventional one running on CPUs.