Information retrieval systems conventionally assess document relevance using the bag of words model. Consequently, relevance scores of documents retrieved for different queries are often difficult to compare, as they are computed on different (or even disjoint) sets of textual features. Many tasks, such as federation of search results or global thresholding of relevance scores, require that scores be globally comparable. To achieve this aim, we propose methods for non-monotonic transformation of relevance scores into probabilities for a contextual advertising selection engine that uses a vector space model. The calibration of the raw scores is based on historical click data. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—retrieval models General Terms Algorithms, Experimentation, Measurement, Performance Keywords Relevance scores, probability of relevance, logistic regression, online advertising