Sciweavers

IJCAI
2003

Information Extraction from Tree Documents by Learning Subtree Delimiters

14 years 28 days ago
Information Extraction from Tree Documents by Learning Subtree Delimiters
Information extraction from HTML pages has been conventionally treated as plain text documents extended with HTML tags. However, the growing maturity and correct usage of HTML/XHTML formats open an opportunity to treat Web pages as trees, to mine the rich structural context in the trees and to learn accurate extraction rules. In this paper, we generalize the notion of delimiter developed for the string information extraction to tree documents. Similar to delimiters in strings, we define delimiters in tree documents as subtrees surrounding the text leaves. We formalize the wrapper induction for tree documents as learning the classification rules based on the subtree delimiters. We analyze a restricted case of subtree delimiters in the form of simple paths. We design an efficient data structure for storing candidate delimiters and an incremental algorithm for finding most discriminative subtree delimiters for the wrapper.
Boris Chidlovskii
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Where IJCAI
Authors Boris Chidlovskii
Comments (0)