In this paper, we propose a tree-based multidimensional structure, GeM-Tree, which indexes both images and videos within a single general framework utilizing Earth Mover’s Distance. It can support different content-based image and video retrieval approaches, and can accommodate applications where the cross-similarity between images and videos need to be considered during content-based retrievals. Furthermore, it is flexible enough to index different video classification units and can maintain the hierarchical relationship between them. In addition, it uses a construct called Hierarchical Markov Model Mediator to introduce high-level semantic relationships among images and different levels of video units. The experimental results indicate that GeM-Tree is a promising generalized index structure for multimedia data with low computational overhead, is flexible enough to support different retrieval approaches and generates query results with high relevance.