Towards domain-independent information extraction from web tables

15 years 4 months ago

Download www2007.org

Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of web pages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The thereby obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web." Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications ? Data Mining; H.3.3 ...

Bernhard Krüpl, Bernhard Pollak, Marcus Herzo

Real-time Traffic

Internet Technology | Large-scale Knowledge Acquisition | Two-dimensional Visual Box | Web Tables | WWW 2007 |

claim paper

Post Info
More Details (n/a)

Added	22 Nov 2009
Updated	22 Nov 2009
Type	Conference
Year	2007
Where	WWW
Authors	Bernhard Krüpl, Bernhard Pollak, Marcus Herzog, Paul Bohunsky, Wolfgang Gatterbauer

Comments (0)

Sciweavers

Towards domain-independent information extraction from web tables

Internet Technology | Large-scale Knowledge Acquisition | Two-dimensional Visual Box | Web Tables | WWW 2007 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers