Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering

16 years 5 days ago

Download www.bloechle.ch

This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original document layout structure. Xed mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in a hierarchical canonical form, i.e. which is universal and independent of the document type. This article first reviews the major traps and tricks of the PDF format. It then introduces the architecture of Xed along with its main modules, and, in particular, the document physical structure extraction algorithm. Later on, a canonical format is proposed and discussed with an example. Finally the results of a practical evaluation are presented, followed by an outline of future works on the logical structure extraction.

Maurizio Rigamonti, Jean-Luc Bloechle, Karim Hadja

Real-time Traffic

Document Analysis | Document Layout Structure | ICDAR 2005 | Layout Structure | Structure Extraction |

claim paper

» XCDF A Canonical and Structured Document Format

» VIFOR 2 a tool for browsing and documentation

» CE2 towards a large scale hybrid search engine with integrated ranking support

» Injecting information into atomic units of text

Post Info
More Details (n/a)

Added	24 Jun 2010
Updated	24 Jun 2010
Type	Conference
Year	2005
Where	ICDAR
Authors	Maurizio Rigamonti, Jean-Luc Bloechle, Karim Hadjar, Denis Lalanne, Rolf Ingold

Comments (0)

Sciweavers

Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering

Document Analysis | Document Layout Structure | ICDAR 2005 | Layout Structure | Structure Extraction |

Explore & Download

Productivity Tools

Sciweavers