Sciweavers

SIGMOD
2009
ACM

Robust web extraction: an approach based on a probabilistic tree-edit model

14 years 7 months ago
Robust web extraction: an approach based on a probabilistic tree-edit model
On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently—in quadratic time in the size of the trees—making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the...
Nilesh N. Dalvi, Philip Bohannon, Fei Sha
Added 19 May 2010
Updated 19 May 2010
Type Conference
Year 2009
Where SIGMOD
Authors Nilesh N. Dalvi, Philip Bohannon, Fei Sha
Comments (0)