Large 0-1 datasets arise in various applications, such as market basket analysis and information retrieval. We concentrate on the study of topic models, aiming at results which indicate why certain methods succeed or fail. We describe simple algorithms for finding topic models from 0-1 data. We give theoretical results showing that the algorithms can discover the epsilon-separable topic models of Papadimitriou et al. We present empirical results showing that the algorithms find natural topics in real-world data sets. We also briefly discuss the connections to matrix approaches, including nonnegative matrix factorization and independent component analysis. Categories and Subject Descriptors G.3 [Probability and Statistics]: Contingency table analysis; H.2.8 [Database Management]: Database Applications--Data mining; I.5.1 [Pattern Recognition]: Models--Structural General Terms Algorithms, Theory
Ella Bingham, Heikki Mannila, Jouni K. Seppän