—Content-based document image retrieval is a new and promising research area. Without OCR, document indexing directly based on image content is more general and convenient. However content-based Chinese document retrieval is difficult for the complexity of Chinese character structure and large class numbers. Few papers cover this issue, and this paper will focus on it. This paper presents a novel algorithm of knowledge-based clustering and gives a mechanism of serial batch clustering for large data set. Knowledge derives from an artificial document image collection. Chinese characters with high frequency are edited and synthesized to images automatically. Cluster IDs are adopted to index the characters. A Dream of Red Mansions, a famous classical Chinese literature work including near one million characters, is used to evaluate the performance of Chinese keyword spotting. Experimental results confirm the effectiveness of knowledge-based clustering and its application on Chinese keywo...