In text-dependent speaker verification the speech signals have to be time-aligned. For that purpose dynamic time warping (DTW) can be used which performs the alignment by minimizing the Euclidean cepstral distance between the test and the reference utterance. While the cumulative Euclidean cepstral distance, which can be gathered from the DTW algorithm, could be used directly to discriminate between a pair of signals spoken by the same and by two different speakers, we show that a distance measure learned by an artificial neural network performs significantly better for the same task.