

Word Segmentation and Recognition for Web Document Framework

14 years 6 months ago
Word Segmentation and Recognition for Web Document Framework
It is observed that a better approach to Web information understanding is to base on its document framework, which is mainly consisted of (i) the title and the URL name of the page, (ii) the titles and the URL names of the Web pages that it points to, (iii) the alternative information source for the embedded Web objects, and (iv) its linkage to other Web pages of the same document. Investigation reveals that a high percentage of words inside the document framework are "compound words" which cannot be understood by ordinary dictionaries. They might be abbreviations or acronyms, or concatenations of several (partial) words. To recover the content hierarchy of Web documents, we propose a new word segmentation and recognition mechanism to understand the information derived from the Web document framework. A maximal bi-directional matching algorithm with heuristic rules is used to resolve ambiguous segmentation and meaning in compound words. An adaptive training process is furthe...
Chi-Hung Chi, Chen Ding, Andrew Lim
Added 03 Aug 2010
Updated 03 Aug 2010
Type Conference
Year 1999
Where CIKM
Authors Chi-Hung Chi, Chen Ding, Andrew Lim
Comments (0)