Linguists and geographers are more and more interested in route direction documents because they contain interesting motion descriptions and language patterns. A large number of such documents can be easily found on the Internet. A challenging task is to automatically extract meaningful route parts, i.e. destinations, origins and instructions, from route direction documents. However, no work exists on this issue. In this paper, we introduce our effort toward this goal. Based on our observation that sentences are the basic units for route parts, we extract sentences from HTML documents using both the natural language knowledge and HTML tag information. Additionally, we study the sentence classification problem in route direction documents and its sequential nature. Several machine learning methods are compared and analyzed. The impacts of different sets of features are studied. Based on the obtained insights, we propose to use sequence labelling models such as CRFs and MEMMs and the...
Xiao Zhang, Prasenjit Mitra, Sen Xu, Anuj R. Jaisw