In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set...
Takashi Tashiro, Takanori Ueda, Taisuke Hori, Yu H...
Web pages contain a combination of unique content and template material, which is present across multiple pages and used primarily for formatting, navigation, and branding. We stu...
Photo community sites such as Flickr and Picasa Web Album host a massive amount of personal photos with millions of new photos uploaded every month. These photos constitute an ove...
Liangliang Cao, Jie Yu, Jiebo Luo, Thomas S. Huang
Freshness has been increasingly realized by commercial search engines as an important criteria for measuring the quality of search results. However, most information retrieval met...
Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby p...