Rule Learning for Feature Values Extraction from HTML Product Information Sheets

15 years 12 months ago

Download software.ucv.ro

The Web is now a huge information repository with a rich semantic structure that, however, is primarily addressed to human understanding rather than automated processing by a computer. The problem of collecting product information from the Web and organizing it in an appropriate way for automated machine processing is a primary task of software shopping agents and has received a lot of attention during the last years. In this paper we assume that product information is represented as a set of feature-value pairs contained in an HTML product information sheet that is usually formatted using HTML tables. The paper presents a technique for learning extraction rules of product information from such product information sheets. The technique exploits the fact that the Web pages that represent product information of a certain producer are generated on the ﬂy from the producer database and therefore they exhibit uniform structures. Consequently, while the extraction task is executed manually...

Costin Badica, Amelia Badica

Real-time Traffic