Sciweavers

WWW
2006
ACM

Using graph matching techniques to wrap data from PDF documents

15 years 1 months ago
Using graph matching techniques to wrap data from PDF documents
Wrapping is the process of navigating a data source, semiautomatically extracting data and transforming it into a form suitable for data processing applications. There are currently a number of established products on the market for wrapping data from web pages. One such approach is Lixto [1], a product of research performed at our institute. Our work is concerned with extending the wrapping functionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational matching techniques on this graph to locate wrapping instances. Categories and Subject Descriptors: I.7.5 [Document and Text Processing]: Document Capture--document analysis; H.3.3 [Information Systems]: Information Search and Retrieval General Terms: Algorithms, Experimentation
Tamir Hassan, Robert Baumgartner
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2006
Where WWW
Authors Tamir Hassan, Robert Baumgartner
Comments (0)