How humans infer probable information from the limited observed data? How they are able to build on little knowledge about the context in hand? Is the human memory repeatedly constructing and reconstructing the events that are being recalled? These are a few questions that we are interested in answering with our multimodal memory game (MMG) platform that studies human memory and their behaviors while watching and remembering TV dramas for a better recall. Based on the preliminary results of human learning obtained from the MMG games, we attempt to show that the human memory recall improves steadily with the number of game sessions. As an example case, we provide a comparison for the text-to-text and text-image-to-text learning and demonstrate that the addition of image context is useful in improving the learning.