This paper presents a new method to detect and accurately extract the moving object from a video sequence taken by a hand-held camera. In order to extract the high quality moving foreground, previous approaches usually assume that the background is static or through only planar-perspective transformation. In our method, based on the robust motion estimation, we are capable of handling challenging videos where the background contains complex depth and the camera undergoes unknown motions. We propose the appearance and structure consistency constraint in 3D warping to robustly model the background, which greatly improves the foreground separation even on the object boundary. The estimated dense motion field and the bilayer segmentation result are iteratively refined where continuous and discrete optimizations are alternatively used. Experimental results of high quality moving object extraction from challenging videos demonstrate the effectiveness of our method.