Distributed representations of words are attractive since they provide a means for measuring word similarity. However, most approaches to learning distributed representations are divorced from the task context. In this paper, we describe a model that learns distributed representations of words in order to optimize task performance. We investigate this model for part-of-speech tagging and supertagging tasks and demonstrate its superior accuracy over localist models, especially for rare words. We also show that adding non-linearity in the model aids in improved accuracy for complex tasks such as supertagging.