Simultaneous record detection and attribute labeling in web data extraction

16 years 7 months ago

Download research.microsoft.com

Recent work has shown the feasibility and promise of templateindependent Web data extraction. However, existing approaches use decoupled strategies ? attempting to do data record detection and attribute labeling in two separate phases. In this paper, we show that separately extracting data records and attributes is highly ineffective and propose a probabilistic model to perform these two tasks simultaneously. In our approach, record detection can benefit from the availability of semantics required in attribute labeling and, at the same time, the accuracy of attribute labeling can be improved when data records are labeled in a collective manner. The proposed model is called Hierarchical Conditional Random Fields. It can efficiently integrate all useful features by learning their importance, and it can also incorporate hierarchical interactions which are very important for Web data extraction. We empirically compare the proposed model with existing decoupled approaches for product infor...

Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Y

Real-time Traffic

Attribute Labeling | Data Mining | Data Record Detection | KDD 2006 | Web Data Extraction |

claim paper

» WebTables exploring the power of tables on the web

» Academic conference homepage understanding using constrained hierarchical conditional rand...

Post Info
More Details (n/a)

Added	30 Nov 2009
Updated	30 Nov 2009
Type	Conference
Year	2006
Where	KDD
Authors	Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma

Comments (0)

Sciweavers

Simultaneous record detection and attribute labeling in web data extraction

Attribute Labeling | Data Mining | Data Record Detection | KDD 2006 | Web Data Extraction |

Explore & Download

Productivity Tools

Sciweavers