From Layout to Semantic: a Reranking Model for Mapping Web Documents to Mediated XML Representations

15 years 8 months ago

Download eprints.pascal-network.org

Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predeﬁned mediated semi-structured format which will be more amenable to automatic processing of the document content. We develop a machine learning approach to this conversion problem where the transformation is learned automatically from a set of document examples manually transformed into the target structure. Our method proceeds in three steps. Given an input document, document elements are ﬁrst annotated with labels of the target schema. Structured candidate documents are then generated using a generalized probabilistic context-free parsing algorithm. Finally candidates are reranked using a perceptron like ranking algorithm. Experiments performed on two different datasets show that the proposed met...

Guillaume Wisniewski, Patrick Gallinari

Real-time Traffic

Documents | Information Technology | Many Documents | RIAO 2007 | Weakly Structured Format |

claim paper

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2007
Where	RIAO
Authors	Guillaume Wisniewski, Patrick Gallinari

Sciweavers

From Layout to Semantic: a Reranking Model for Mapping Web Documents to Mediated XML Representations

Documents | Information Technology | Many Documents | RIAO 2007 | Weakly Structured Format |

Explore & Download

Productivity Tools

Sciweavers