Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment

14 years 4 months ago

Download www.aaai.org

This paper is concerned with the problem of structured data extraction from Web pages. The objective of the research is to automatically segment data records in a page, extract data items/fields from these records and store the extracted data in a database. In this paper, we first introduce the extraction problem, and then discuss the main existing approaches and their limitations. After that, we introduce a novel technique (called DEPTA) to automatically perform Web data extraction. The method consists of three steps: (1) identifying data records with similar patterns in a page, (2) aligning and extracting data items from the identified data records and (3) generating tree-based regular expressions to facilitate later extraction from other similar pages. The key innovation is the proposal of a new multiple tree alignment algorithm called partial tree alignment, which was found to be particularly suitable for Web data extraction. This paper is based on our work published in KDD-03 and...

Yanhong Zhai, Bing Liu

Real-time Traffic

AAAI 2006 | Data Extraction | Data Records | Intelligent Agents | Web Data Extraction |

claim paper

Post Info
More Details (n/a)

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2006
Where	AAAI
Authors	Yanhong Zhai, Bing Liu

Comments (0)

Sciweavers

Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment

AAAI 2006 | Data Extraction | Data Records | Intelligent Agents | Web Data Extraction |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers