Efficient multimodal fusion is a key feature of future video indexing systems. Hidden Markov Models provide a powerful framework for video structure analysis but they require all video modalities to be strictly synchronous. Taking as a case study tennis broadcasts analysis, we introduce into video indexing Segment Models, a generalization of Hidden Markov Models, where the fusion of different modalities can be performed with relaxed synchrony constraints. Segment Models were experimentally proved to perform marginally better compared to Hidden Markov Models.