We present a computationally efficient, on-line graph structure estimation method for model-based scene interpretation. Different scenes have different hierarchical graphical models composed of place, objects, and parts. Generally, it is very difficult and time-consuming to estimate dynamic graph structures. The key idea is to represent hypothesized graph structures as multi-modal particles instead of joint particle representation. Such Monte Carlo representation makes the one-line hierarchical graph structure estimation feasible. The proposed method is supported by the neurobiological inference model. Large-scale experimental results in an indoor (12 places, 112 3D objects) validate the feasibility of the proposed inference method.