PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents

16 years 2 months ago

Download www.cvc.uab.es

This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents. The heuristics starts from an initial set of basic content elements and aligns and groups them, in bottom-up way by considering only their spatial features, in order to identify tabular arrangements of information. The scope of the approach is to recognize tables contained in PDF documents as a 2-dimensional grid on a Cartesian plane and extract them as a set of cells equipped by 2-dimensional coordinates. Experiments, carried out on a dataset composed of tables contained in documents coming from different domains, shows that the approach is well performing in recognizing table cells. The approach aims at improving PDF document annotation and information extraction by providing an output that can be further processed for understanding table and document contents.

Ermelinda Oro, Massimo Ruffolo

Real-time Traffic

Basic Content Elements | Document Analysis | Heuristic Approach | ICDAR 2009 | Pdf Document |

claim paper

» Exploiting web search to generate synonyms for entities

» CaseBased Reasoning for Invoice Analysis and Recognition

Post Info
More Details (n/a)

Added	21 May 2010
Updated	21 May 2010
Type	Conference
Year	2009
Where	ICDAR
Authors	Ermelinda Oro, Massimo Ruffolo

Comments (0)

Sciweavers

PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents

Basic Content Elements | Document Analysis | Heuristic Approach | ICDAR 2009 | Pdf Document |

Explore & Download

Productivity Tools

Sciweavers