Three-dimensional (3-D) models of outdoor scenes are widely used for object recognition, navigation, mixed reality, and so on. Because such models are often made manually with high costs, automatic 3-D modeling has been investigated. A 3-D model is usually generated by using a stereo method. However, such approaches cannot use several hundreds images together for dense depth estimation because it is difficult to accurately calibrate a large number of cameras. In this paper, we propose a 3-D modeling method that first estimates extrinsic camera parameters of a monocular image sequence captured by a moving video camera, and then reconstructs a 3-D model of a scene. We can acquire a 3-D model of an outdoor scene accurately by using several hundreds input images.