We apply a new active learning formulation to the problem of learning medical concepts from unstructured text. The new formulation is based on maximizing the mutual information that a sample labeling provides about the retrieval/classification model. This methodology is related to and extends the Query-by-Committee approach (QBC) [12] by exploiting unlabeled data in novel ways, beyond their common use only as potential query points. Unlike QBC, this method allows us to employ unlabeled data in addition to labeled data in order to select more appropriate samples for labeling. The samples thus chosen are both informative and also relevant according to a distribution of interest. This flexibility allows us to also tailor the model to arbitrary distributions relevant to the task at hand, in particular to the distribution of the test data. This formulation has implications in scenarios where the training and test distributions are different, or when a general model is adapted to a more s...
Rómer Rosales, Praveen Krishnamurthy, R. Bh