Can chinese web pages be classified with english data source?

16 years 7 months ago

Download www2008.org

As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages becomes more and more important. Mining Web information usually relies on the machine learning techniques which require a large amount of labeled data to train credible models. Although the number of Chinese Web pages increases quite fast, it still lacks Chinese labeled data. However, there are relatively sufficient English labeled Web pages. These labeled data, though in different linguistic representations, share a substantial amount of semantic information with Chinese ones, and can be utilized to help classify Chinese Web pages. In this paper, we propose an information bottleneck based approach to address this cross-language classification problem. Our algorithm first translates all the Chinese Web pages to English. Then, all the Web pages, including Chinese and English ones, are encoded through an information bottleneck which can allow only limited information to pass. Therefore, in order to retai...

Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Q

Real-time Traffic

Chinese Labeled Data | Chinese Web Pages | English Web Pages | Internet Technology | WWW 2008 |

claim paper

» Automatic Acquisition of ChineseEnglish Parallel Corpus from the Web

» A Discriminative Latent VariableBased DE Classifier for ChineseEnglish SMT

» Bilingual web page and site readability assessment

» Evaluating Utility of Data Sources in a Large Parallel CzechEnglish Corpus CzEng 09

» A Lightweight and Efficient Tool for Cleaning Web Pages

» Mining Parenthetical Translations from the Web by Word Alignment

» Curate a transliteration corpus from transliterationtranslation pairs

» Crosslingual query classification a preliminary study

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2008
Where	WWW
Authors	Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang, Yong Yu

Comments (0)

Sciweavers

Can chinese web pages be classified with english data source?

Chinese Labeled Data | Chinese Web Pages | English Web Pages | Internet Technology | WWW 2008 |

Explore & Download

Productivity Tools

Sciweavers