Effective similarity search in multi-media time series such as video or audio sequences is important for content-based multi-media retrieval applications. We propose a framework that extracts a sequence of local features from large multi-media time series that reflect the characteristics of the complex structured time series more accurately than global features. In addition, we propose a set of suitable local features that can be derived by our framework. These features are scanned from a time series amplitude-levelwise and are called amplitude-level features. Our experimental evaluation shows that our method models the intuitive similarity of multi-media time series better than existing techniques.