Learning visually grounded words and syntax for a scene description task

15 years 6 months ago

Download www.media.mit.edu

A spoken language generation system has been developed that learns to describe objects in computer-generated visual scenes. The system is trained by a `show-and-tell' procedure in which visual scenes are paired with natural language descriptions. Learning algorithms acquire probabilistic structures which encode the visual semantics of phrase structure, word classes, and individual words. Using these structures, a planning algorithm integrates syntactic, semantic, and contextual constraints to generate natural and unambiguous descriptions of objects in novel scenes. The system generates syntactically well-formed compound adjective noun phrases, as well as relative spatial clauses. The acquired linguistic structures generalize from training data, enabling the production of novel word sequences which were never observed during training. The output of the generation system is synthesized using word-based concatenative synthesis drawing from the original training speech corpus. In eva...

Deb K. Roy

Real-time Traffic

Automated Reasoning | CSL 2002 | Spoken Language | Spoken Language Generation | Visual Scenes |

claim paper

» Language Label Learning for Visual Concepts Discovered from Video Sequences

» Embodied Active Vision in Language Learning and Grounding

» Learning TRECVID08 HighLevel Features from YouTube

Post Info
More Details (n/a)

Added	18 Dec 2010
Updated	18 Dec 2010
Type	Journal
Year	2002
Where	CSL
Authors	Deb K. Roy

Comments (0)

Sciweavers

Learning visually grounded words and syntax for a scene description task

Automated Reasoning | CSL 2002 | Spoken Language | Spoken Language Generation | Visual Scenes |

Explore & Download

Productivity Tools

Sciweavers