

Using graph matching techniques to wrap data from PDF documents

15 years 15 days ago
Using graph matching techniques to wrap data from PDF documents
Wrapping is the process of navigating a data source, semiautomatically extracting data and transforming it into a form suitable for data processing applications. There are currently a number of established products on the market for wrapping data from web pages. One such approach is Lixto [1], a product of research performed at our institute. Our work is concerned with extending the wrapping functionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational matching techniques on this graph to locate wrapping instances. Categories and Subject Descriptors: I.7.5 [Document and Text Processing]: Document Capture--document analysis; H.3.3 [Information Systems]: Information Search and Retrieval General Terms: Algorithms, Experimentation
Tamir Hassan, Robert Baumgartner
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2006
Where WWW
Authors Tamir Hassan, Robert Baumgartner
Comments (0)