In this paper, we propose a generative model-based approach for audio-visual event classification. This approach is based on a new unsupervised learning method using an extended probabilistic Latent Semantic Analysis (pLSA) model. We represent each video clip as a collection of spatial-temporal-audio words, which are generated by fusing the visual and audio features using the pLSA model. Each audiovisual event class is treated as the latent topic in this model. The probability distributions of the spatialtemporal-audio words are learned from training examples, which include a sequence of videos that represent different types of audio-visual events. Experimental results show the effectiveness of the proposed approach.