This paper presents a prototype dialogue system, K2, in which a user can instruct agents through speech input to manipulate various objects in a 3-D virtual world. The agents’ action is presented to the user as an animation. To build such a system, we have to deal with some of the deeper issues of natural language processing such as ellipsis and anaphora resolution, handling vagueness, and so on. In this paper, we focus on three distinctive features of the K2 system: handling ill-formed speech input, plan-based anaphora resolution and handling vagueness in spatial expressions. After an overview of the system architecture, each of these features is described. We also look at the future research agenda of this system.