Spatial priors play crucial roles in many high-level vision tasks, e.g. scene understanding. Usually, learning spatial priors relies on training a structured output model. In this paper, two special cases of discriminative structured output model, i.e. Conditional Random Fields (CRFs) and Max-margin Markov Networks (M3 N), are demonstrated to perform image scene understanding. The two models are empirically compared in a fair manner, i.e. using the common feature representation and the same optimization algorithm. Particularly, we adopt online Exponentiated Gradient (EG) algorithm to solve the convex duals of both models. We describe the general procedure of EG algorithm and present a two-stage training procedure to overcome the degeneration of EG when exact inference is intractable. Experiments on a large scale image region annotation task are carried out. The results show that both models yield encouraging results but CRFs slightly outperforms M3 N.