Learning to Extract Text-Based Information from the World Wide Web

15 years 11 months ago

Download www.aaai.org

Thereis a wealthof informationto be minedfromnarrative text on the WorldWideWeb.Unfortunately, standard natural language processing (NLP)extraction techniques expect full, grammaticalsentences, andperform poorly on the choppysentence fragments that are often found on webpages. This paper1 introduces Webfoot,a preprocessor that parses webpages into logically coherent segments based on page layout cues. Output from Webfootis then passed on to CRYSTAL,an NLPsystem that learns text extraction rules from example. Webfoot and CRYSTALtransform the text into a formal representation that is equivalent to relational database entries. This is a necessary first step for knowledge discoveryand other automatedanalysis of free text. Information Extraction from the Web The World WideWebcontains a wealth of text information in the form of free text. Until a text extraction system transforms it into an unambiguousformat, muchof this information remains inaccessible to automated knowledge discovery tech...

Stephen Soderland

Real-time Traffic