In this paper, we address the problem of evaluating spontaneous speech using a combination of machine learning and crowdsourcing. Machine learning techniques inadequately solve the stated problem because automatic speakerindependent speech transcription is inaccurate. The features derived from it are also inaccurate and so is the machine learning model developed for speech evaluation. To address this, we post the task of speech transcription to a large community of online workers (crowd). We also get spoken English grades from the crowd. We achieve 95% transcription accuracy by combining transcriptions from multiple crowd workers. Speech and prosody features are derived by force aligning the speech samples on these highly accurate transcriptions. Additionally, we derive surface and semantic level features directly from the transcription. To demonstrate the efficacy of our approach we performed experiments on an expert–graded speech sample of 319 adult non–native speakers. Using t...