EPCI: extracting potentially copyright infringement texts from the web

16 years 1 months ago

Download www2007.org

In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then reranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average. Categories and Subject Descriptors: H.3.3 [INFORMATION STORAGE AND RETRIEVAL]: ? Information Search and Retrieval General Terms: Experimentation

Takashi Tashiro, Takanori Ueda, Taisuke Hori, Yu H

Real-time Traffic

Copyright Infringement Texts | Copyright Infringement Web | Internet Technology | STORAGE AND RETRIEVAL | WWW 2007 |

claim paper

Added	22 Nov 2009
Updated	22 Nov 2009
Type	Conference
Year	2007
Where	WWW
Authors	Takashi Tashiro, Takanori Ueda, Taisuke Hori, Yu Hirate, Hayato Yamana

Sciweavers

EPCI: extracting potentially copyright infringement texts from the web

Copyright Infringement Texts | Copyright Infringement Web | Internet Technology | STORAGE AND RETRIEVAL | WWW 2007 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers