Expressing web page content in a way that computers can understand is the key to a semantic web. Generating ontological information from the web automatically using machine learning shows great promise towards this goal. We present LASSO, an architecture that combines distributed components for training web page classifiers via machine learning and information extraction, and then labels new pages with the classifiers. LASSO's results are semantic models of web pages stored in a database back end, and the models are defined with respect to whatever ontology the user chooses. LASSO can be used to build a wide variety of applications or can be used as a collaborative experimentation workbench. We give as part of our proof-of-concept prototype an application of an enhanced ontological search engine. We also describe how LASSO can be used to compare machine learning algorithms and analyze our system with a code reuse metric.
Christopher N. Hammack, Stephen D. Scott