Recent studies in signal detection theory suggest that the human responses to the stimuli on a visual display are nondeterministic. People may attend to different locations on the same visual input at the same time. To predict the likelihood of where humans typically focus on a video scene, we propose a new stochastic model of visual attention by introducing a dynamic Bayesian network. Our model simulates and combines the visual saliency response and the cognitive state of a person to estimate the most probable attended regions. Experimental results have demonstrated that our model performs significantly better in predicting human visual attention compared to the previous deterministic model.