Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

15 years 11 months ago

Download www.i.kyushu-u.ac.jp

We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from diﬀerent sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust fo...

Daisuke Ikeda, Yasuhiro Yamada, Sachio Hirokawa

Real-time Traffic

DIS 2001 | Distinguishes Useless Parts | Non-useless Parts | Theoretical Computer Science | Useless Parts |

claim paper

Post Info
More Details (n/a)

Added	28 Jul 2010
Updated	28 Jul 2010
Type	Conference
Year	2001
Where	DIS
Authors	Daisuke Ikeda, Yasuhiro Yamada, Sachio Hirokawa

Comments (0)

Sciweavers

Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

DIS 2001 | Distinguishes Useless Parts | Non-useless Parts | Theoretical Computer Science | Useless Parts |

Explore & Download

Productivity Tools

Sciweavers