Information Extraction from Tree Documents by Learning Subtree Delimiters

14 years 4 months ago

Download www.isi.edu

Information extraction from HTML pages has been conventionally treated as plain text documents extended with HTML tags. However, the growing maturity and correct usage of HTML/XHTML formats open an opportunity to treat Web pages as trees, to mine the rich structural context in the trees and to learn accurate extraction rules. In this paper, we generalize the notion of delimiter developed for the string information extraction to tree documents. Similar to delimiters in strings, we deﬁne delimiters in tree documents as subtrees surrounding the text leaves. We formalize the wrapper induction for tree documents as learning the classiﬁcation rules based on the subtree delimiters. We analyze a restricted case of subtree delimiters in the form of simple paths. We design an efﬁcient data structure for storing candidate delimiters and an incremental algorithm for ﬁnding most discriminative subtree delimiters for the wrapper.

Boris Chidlovskii

Real-time Traffic

Delimiters | IJCAI 2003 | IJCAI 2007 | Information Extraction | Tree Documents |

claim paper

Post Info
More Details (n/a)

Added	31 Oct 2010
Updated	31 Oct 2010
Type	Conference
Year	2003
Where	IJCAI
Authors	Boris Chidlovskii

Comments (0)

Sciweavers

Information Extraction from Tree Documents by Learning Subtree Delimiters

Delimiters | IJCAI 2003 | IJCAI 2007 | Information Extraction | Tree Documents |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers