A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal...
Discovering episodes, frequent sets of events from a sequence has been an active field in pattern mining. Traditionally, a level-wise approach is used to discover all frequent epis...
Many applications in text processing require significant human effort for either labeling large document collections (when learning statistical models) or extrapolating rules from...
We introduce a new EM framework in which it is possible not only to optimize the model parameters but also the number of model components. A key feature of our approach is that we...
Kernel k-means and spectral clustering have both been used to identify clusters that are non-linearly separable in input space. Despite significant research, these methods have re...