Rapidly increasing quantities of multimedia and spoken content today demand fast and accurate retrieval approaches for convenient browsing. The spoken documents with wide variety of different acoustic and linguistic conditions make supervised training of well-matched acoustic/language models very difficult. Unsupervised methods using frame-based dynamic time warping (DTW) require no acoustic/language models but with high computation load. Therefore, segment-based DTW was proposed to relieve the computation load at the cost of degraded detection performance. In this paper, we refine the segment-based DTW by allowing deletion of end segments of query to improve detection performance. The search space is also reduced by segment similarity constraints. We also proposed a two-pass framework. The segment-baed DTW is performed in the first pass to locate hypothesized spoken term region and the frame-based DTW for precise rescoring in the second pass. Then the pseudo relevance feedback is ...