The popularity of Wikipedia and other online knowledge bases has recently produced an interest in the machine learning community for the problem of automatic linking. Automatic hyperlinking can be viewed as two sub problems – link detection which determines the source of a link, and link disambiguation which determines the destination of a link. Wikipedia is rich corpus with hyperlink data provided by authors. It is possible to use this data to train classifiers to be able to mimic the authors in some capacity. In this paper, we introduce automatic link detection as a sequence labeling problem. Conditional random fields (CRFs) are a probabilistic framework for labeling sequential data. We show that training a CRF with different types of features from the Wikipedia dataset can be used to automatically detect links with almost perfect precision and high recall. Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Processing – text analysis.; I.3.1...
James J. Gardner, Li Xiong