A method is presented for segmenting documents into conceptually related areas. Determining the equivalence of text is often based on the number of word repetitions. This approach is unsuitable for detecting short segments because terms tend not to be repeated across just a few sentences. In this paper we investigate the contribution of two other lexical features to find related words: collocation and relation weights (which identify semantic relations). An experiment was conducted on a set of test data with known topic changes; performances of the three features were independently compared. A combination of all features was the most reliable indicator of a topic change. In another experiment, CNN news summaries were segmented into their individual news stories. Precision and recall rates of around 90% are reported for news story boundary detection.
Amanda C. Jobbins, Lindsay J. Evett