Abstract. In this paper, we present a novel method for multimodal gesture recognition based on neural networks. Our multi-stream recurrent neural network (MRNN) is a completely data-driven model that can be trained from end to end without domain-specific hand engineering. The MRNN extends recurrent neural networks with Long Short-Term Memory cells (LSTM-RNNs) that facilitate the handling of variable-length gestures. We propose a recurrent approach for fusing multiple temporal modalities using multiple streams of LSTM-RNNs. In addition, we propose alternative fusion architectures and empirically evaluate the performance and robustness of these fusion strategies. Experimental results demonstrate that the proposed MRNN outperforms other state-of-theart methods in the Sheffield Kinect Gesture (SKIG) dataset, and has significantly high robustness to noisy inputs.