Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noi...
Ying Liu, Prasenjit Mitra, C. Lee Giles