Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

142

CLEIEJ
2008

72views more CLEIEJ 2008»

Measuring Contribution of HTML Features in Web Document Clustering

15 years 6 months ago

Measuring Contribution of HTML Features in Web Document Clustering

Download www.clei.cl

Documents in HTML format have many features to analyze, from the terms in special sections to the phrases that appear in the whole document. However, it is important to decide which feature contributes the most to separate documents according to classes. Given this information, it is possible not to include certain feature in the representation for the document, given that it is expensive to compute and doesn't contribute enough in the clustering process. By using a novel representation model and the standard k-means algorithm, we discovered that terms in the body of document contributes the most, followed by terms in other sections. Suffix tree provides poor contribution in that scenario, while term order graphs influence a little the partition. We used 4 known datasets to support the conclusions.

Esteban Meneses, Oldemar Rodríguez-Rojas

Real-time Traffic

CLEIEJ 2008 | Document | HTML Format | Special Sections |

claim paper

Related Content

» Measuring Effectiveness of TextDecorated HTML Tags in Web Document Clustering

» Hierarchical Classification of HTML Documents with WebClassII

» Title extraction from bodies of HTML documents and its application to web page retrieval

» Tracking Web Spam with Hidden Style Similarity

» Measuring DataDriven Ontology Changes using Text Mining

» Experiments in Term Weighting and Keyword Extraction in Document Clustering

» Scalable Web Mining with Newistic

» Discovering informative content blocks from Web documents

» GeneXplorer an interactive web application for microarray data visualization and analysis

Post Info
More Details (n/a)

Added	09 Dec 2010
Updated	09 Dec 2010
Type	Journal
Year	2008
Where	CLEIEJ
Authors	Esteban Meneses, Oldemar Rodríguez-Rojas

Comments (0)