We propose a new statistical model, named Hierarchical Topic Trajectory Model (HTTM), for acquiring a dynamically changing topic model that represents the relationship between video frames and associated text labels. Model parameter estimation, annotation and retrieval can be executed within a unified framework with a few computation. It is also easy to add new modals such as audio signal and geotags. Preliminary experiments on video annotation task with manually annotated video dataset indicate that our proposed method can improve the annotation accuracy.