Clustering Template Based Web Documents

15 years 8 months ago

Download www.informatik.uni-mainz.de

More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result. As more and more documents on the World Wide Web are generated automatically by Content Management Systems (CMS), more and more of them are based on templates. Templates can be seen as framework documents which are filled with different contents to compile the final documents. They are a standard (if not even essential) CMS technology. Templates provide the managed web sites wit...

Thomas Gottron

Real-time Traffic