User-Guided Wrapping of PDF Documents Using Graph Matching Techniques

14 years 7 months ago

Download www.cvc.uab.es

There are a number of established products on the market for wrapping—semi-automatic navigation and extraction of data—from web pages. These solutions make use of the inherent structure of HTML to locate instances of data to be wrapped. As PDF documents do not have such a structure, wrapping PDF documents has long been recognized as a challenging problem. We have developed a novel system for wrapping PDF documents, which is currently at a prototype stage. A PDF document is represented as an attributed relational graph, in which nodes represent physical items on the page and edges represent spatial and logical relationships. A wrapper is deﬁned as a subgraph of the document with additional conditions, and can quickly and intuitively be created by a non-expert using the GUI. An algorithm based on subgraph isomorphism is then used to ﬁnd the data instances and extract the required data. Experiments show that our approach achieves good results with good execution time.

Tamir Hassan

Real-time Traffic

Attributed Relational Graph | Data—from Web Pages | Document Analysis | ICDAR 2009 | Pdf Document |

claim paper

Post Info
More Details (n/a)

Added	21 May 2010
Updated	21 May 2010
Type	Conference
Year	2009
Where	ICDAR
Authors	Tamir Hassan

Comments (0)

Sciweavers

User-Guided Wrapping of PDF Documents Using Graph Matching Techniques

Attributed Relational Graph | Data—from Web Pages | Document Analysis | ICDAR 2009 | Pdf Document |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers