Unsupervised sequence learning is important to many applications. A learner is presented with unlabeled sequential data, and must discover sequential patterns that characterize the data. Popular approaches to such learning include (and often combine) frequency-based approaches and statistical analysis. However, the quality of results is often far from satisfactory. Though most previous investigations seek to address method-specific limitations, we instead focus on general (methodneutral) limitations in current approaches. This paper takes two key steps towards addressing such general quality-reducing flaws. First, we carry out an in-depth empirical comparison and analysis of popular sequence learning methods in terms of the quality of information produced, for several synthetic and real-world datasets, under controlled settings of noise. We find that both frequency-based and statisticsbased approaches (i) suffer from common statistical biases based on the length of the sequences co...
Yoav Horman, Gal A. Kaminka