

Learning to Extract Text-Based Information from the World Wide Web

14 years 7 months ago
Learning to Extract Text-Based Information from the World Wide Web
Thereis a wealthof informationto be minedfromnarrative text on the WorldWideWeb.Unfortunately, standard natural language processing (NLP)extraction techniques expect full, grammaticalsentences, andperform poorly on the choppysentence fragments that are often found on webpages. This paper1 introduces Webfoot,a preprocessor that parses webpages into logically coherent segments based on page layout cues. Output from Webfootis then passed on to CRYSTAL,an NLPsystem that learns text extraction rules from example. Webfoot and CRYSTALtransform the text into a formal representation that is equivalent to relational database entries. This is a necessary first step for knowledge discoveryand other automatedanalysis of free text. Information Extraction from the Web The World WideWebcontains a wealth of text information in the form of free text. Until a text extraction system transforms it into an unambiguousformat, muchof this information remains inaccessible to automated knowledge discovery tech...
Stephen Soderland
Added 08 Aug 2010
Updated 08 Aug 2010
Type Conference
Year 1997
Where KDD
Authors Stephen Soderland
Comments (0)