Learning to classify short and sparse text & web with hidden topics from large-scale data collections

16 years 7 months ago

Download www2008.org

This paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from largescale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data better. The underlying idea of the framework is that for each classification task, we collect a large-scale external data collection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is gene...

Xuan Hieu Phan, Minh Le Nguyen, Susumu Horiguchi

Real-time Traffic

Internet Technology | Keywords Web Data | Large-scale External Data | Largescale Data Collections | WWW 2008 |

claim paper

» Semantic Smoothing for Bayesian Text Classification with Small Training Data

» A classfeaturecentroid classifier for text categorization

» Automatic web query classification using labeled and unlabeled training data

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2008
Where	WWW
Authors	Xuan Hieu Phan, Minh Le Nguyen, Susumu Horiguchi

Comments (0)

Sciweavers

Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Internet Technology | Keywords Web Data | Large-scale External Data | Largescale Data Collections | WWW 2008 |

Explore & Download

Productivity Tools

Sciweavers