Extracting Author Meta-Data from Web Using Visual Features

16 years 1 months ago

Download www.cse.psu.edu

Enriching digital library’s author meta-data can lead to valuable services and applications. This paper addresses the problem of extracting authors’ information from their homepages. This problem is actually a multiclass classiﬁcation problem. A homepage can be treated as a group of information pieces which need to be classiﬁed to different ﬁelds, e.g., Name, Title, Afﬁliation, Email, etc. In this problem, not only each information piece can be viewed as a point in a feature space, but also certain patterns can be observed among different ﬁelds on a page. To improve the extraction accuracy, this paper argues that visual features of information pieces on a homepage should be sufﬁciently utilized. In addition, this paper also proposes an inter-ﬁelds probability model to capture the relation among different ﬁelds. This model can be combined with featurespace based classiﬁcation. Experimental results demonstrate that utilizing visual features and applying the inter�...

Shuyi Zheng, Ding Zhou, Jia Li, C. Lee Giles

Real-time Traffic