Sciweavers

ERCIMDL
2010
Springer

SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)

13 years 9 months ago
SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)
Extracting titles from a PDFs full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDFs title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an ,,academic search engine scenario and better run times (8:19 minutes vs. 57:26 minutes).
Jöran Beel, Bela Gipp, Ammar Shaker, Nick Fri
Added 02 Mar 2011
Updated 02 Mar 2011
Type Journal
Year 2010
Where ERCIMDL
Authors Jöran Beel, Bela Gipp, Ammar Shaker, Nick Friedrich
Comments (0)