Effective information retrieval in XML documents requires the user to have good knowledge of document structure and of some formal query language. XML query languages like XPath a...
While research on scientific chart recognition is being carried out, there is no suitable standard that can be used to evaluate the overall performance of the chart recognition res...
We present two machine learning approaches to information extraction from semi-structured documents that can be used if no annotated training data are available, but there does ex...
This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated sy...
We introduce the Spherical Admixture Model (SAM), a Bayesian topic model for arbitrary 2 normalized data. SAM maintains the same hierarchical structure as Latent Dirichlet Allocat...
Joseph Reisinger, Austin Waters, Bryan Silverthorn...