In robotics, the idea of human and robot interaction is receiving a lot of attention lately. In this paper, we describe a multi-modal system for generating a map of the environment through interaction of a human and home robot. This system enables people to teach a newcomer robot different attributes of objects and places in the room through speech commands and hand gestures. The robot learns about size, position, and topological relations between objects, and produces a map of the room based on knowledge learned through communication with the human. The developed system consists of several sections including: natural language processing, posture recognition, object localization and map generation. This system combines multiple sources of information and model matching to detect and track a human hand so that the user can point toward an object of interest and guide the robot to either go near it or to locate that object's position in the room. The positions of objects in the room...