Multimodal grammars provide an expressive formalism for multimodal integration and understanding. However, handcrafted multimodal grammars can be brittle with respect to unexpected, erroneous, or disfluent inputs. Spoken language (speech-only) understanding systems have addressed this issue of lack of robustness of hand-crafted grammars by exploiting classification techniques to extract fillers of a frame representation. In this paper, we illustrate the limitations of such classification approaches for multimodal integration and understanding and present an approach based on edit machines that combine the expressiveness of multimodal grammars with the robustness of stochastic language models of speech recognition. We also present an approach where the edit operations are trained from data using a noisy channel model paradigm. We evaluate and compare the performance of the hand-crafted and learned edit machines in the context of a multimodal conversational system (MATCH).