This paper presents a novel probabilistic approach to fusing multimodal metadata for event based home photo clustering. Photo events are characterized by the coherence of multimodality including time, content and camera settings. We incorporate these multimodal metadata into a unified probabilistic framework, in which event is taken as a latent semantic concept and discovered by fitting a generative model through an Expectation-Maximization (EM) algorithm. This approach is general and unsupervised, without any training procedure or predefined threshold. The experimental evaluations on 14k photos taken by 10 amateur photographers have indicated the effectiveness and efficiency of the proposed framework in browsing and searching personal photo collections.