Evolving Better Stoplists for Document Clustering and Web Intelligence

15 years 8 months ago

Download www.macs.hw.ac.uk

: Text classification, document clustering and similar document analysis tasks are currently the subject of significant global research, since such areas underpin web intelligence, web mining, search engine design, and so forth. A fundamental tool in such document analysis tasks is a list of so-called ‘stop’ words, called a ‘stoplist’. A stoplist is a specific collection of so-called ‘noise’ words, which tend to appear frequently in documents, but are believed to carry no usable information which would aid learning tasks, and so the idea is that the words in the stoplist are removed from the documents concerned before processing begins. It is well-known that the results of document classification experiments (for example) are invariably considerably improved when a stoplist is employed. Current stoplists in regular use are, however, rather outdated. We have explored this claim in recent work which produced new stoplists based on word-entropy over modern collections of docum...

Mark P. Sinka, David Corne

Real-time Traffic

Document Analysis Tasks | HIS 2003 | HIS 2007 | Similar Document Analysis | Stoplist |

claim paper

» Agglomerative genetic algorithm for clustering in social networks

» A Probabilistic Approach for Learning Folksonomies from Structured Data

Post Info
More Details (n/a)

Added	31 Oct 2010
Updated	31 Oct 2010
Type	Conference
Year	2003
Where	HIS
Authors	Mark P. Sinka, David Corne

Comments (0)

Sciweavers

Evolving Better Stoplists for Document Clustering and Web Intelligence

Document Analysis Tasks | HIS 2003 | HIS 2007 | Similar Document Analysis | Stoplist |

Explore & Download

Productivity Tools

Sciweavers