In this paper, we consider the problem of keyword query cleaning for structured databases from a probabilistic approach. Keyword query cleaning consists of rewriting the user query, segmenting the keywords, matching each segment to database items, and finally tagging the segments by their meta-data information. We present an efficient and robust solution using Hidden Markov Models (HMM). By modeling user keyword queries using a generative probabilistic HMM-based model, we construct a HMM from the user specified keyword query (and the database instance). The optimal statistical keyword cleaning is computed as the most likely path of the constructed HMM. Furthermore, we demonstrate how the optimal HMM-based keyword cleaning algorithm can be generalized to compute a stream of clean queries ranked from the most likely clean query to the least likely clean query. Finally, we present the implementation of the proposed system and its preliminary performance. Categories and Subject Descriptor...
Ken Q. Pu