Extracting data records from the web using tag path clustering

14 years 4 months ago

Download www2009.org

Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the ﬁrst step of this object extraction process, identiﬁes a set of Web page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suﬃce for simple search, but they often fail to handle more complicated or noisy Web page structures due to a key limitation – their greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the DOM tree of the Web document. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals) to estimate how likely these two tag paths represent the...

Gengxin Miao, Jun'ichi Tatemura, Wang-Pin Hsiung,

Real-time Traffic

Record Extraction | Tag Paths | Web Page | WWW 2009 |

claim paper

Post Info
More Details (n/a)

Added	23 Jul 2010
Updated	23 Jul 2010
Type	Conference
Year	2009
Where	WWW
Authors	Gengxin Miao, Jun'ichi Tatemura, Wang-Pin Hsiung, Arsany Sawires, Louise E. Moser

Comments (0)

Sciweavers

Extracting data records from the web using tag path clustering

Record Extraction | Tag Paths | Web Page | WWW 2009 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers