Current research in emotion recognition focuses on identifying better feature representations and recognition models. The goal of this project is to improve on current automatic emotion recognition performance by identifying more predictive knowledge-driven features, and by building a hierarchical contextual model that combines state-of-the-art statistical and knowledge-driven features at different layers. Our model will have the potential to improve the quality of emotional interactions in current dialogue systems. To improve on current approaches, we propose novel disfluency and non-verbal vocalisation (DIS-NV) based features, and show that they are highly predictive for recognizing emotions in spontaneous dialogues. We also propose an enhanced Long Short-Term Memory Recurrent Neural Network (LSTM) model that combines the DIS-NV features and other acoustic and lexical features at different layers.