Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents

15 years 10 months ago

Download diuf.unifr.ch

PDF became a very common format for exchanging printable documents. Further, it can be easily generated from the major documents formats, which make a huge number of PDF documents available over the net. However its use is limited to displaying and printing, which considerably reduces the search and retrieval capabilities. For this reason, additional tools have recently appeared that allow to extract the textual content. However their practical use is limited in the sense that the text's reading order is not necessary preserved, especially when handling multi-column documents, or in presence of complex layout. Our thesis is that those tools do not consider the hidden layout and logical structures of documents, which could greatly improve their results. We propose a novel approach to overcome the document content extraction, by merging a) low-level extraction methods applied on PDF files with b) layout analysis performed on a synthetically generated TIFF image. The paper describes...

Karim Hadjar, Maurizio Rigamonti, Denis Lalanne, R

Real-time Traffic

DIAL 2004 | Image Analysis | Major Documents Formats | PDF Documents | Printable Documents |

claim paper

» xTagger a new approach to authoring documentcentric XML

» Ontologybased design information extraction and retrieval

» Extracting reusable document components for variable data printing

Post Info
More Details (n/a)

Added	20 Aug 2010
Updated	20 Aug 2010
Type	Conference
Year	2004
Where	DIAL
Authors	Karim Hadjar, Maurizio Rigamonti, Denis Lalanne, Rolf Ingold

Comments (0)

Sciweavers

Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents

DIAL 2004 | Image Analysis | Major Documents Formats | PDF Documents | Printable Documents |

Explore & Download

Productivity Tools

Sciweavers