Using visual pages analysis for optimizing web archiving

15 years 5 months ago

Download www-poleia.lip6.fr

Due to the growing importance of the World Wide Web, archiving it has become crucial for preserving useful source of information. To maintain a web archive up-to-date, crawlers harvest the web by iteratively downloading new versions of documents. However, it is frequent that crawlers retrieve pages with unimportant changes such as advertisements which are continually updated. Hence, web archive systems waste time and space for indexing and storing useless page versions. Also, querying the archive can take more time due to the large set of useless page versions stored. Thus, an eﬀective method is required to know accurately when and how often important changes between versions occur in order to eﬃciently archive web pages. Our work focuses on addressing this requirement through a new web archiving approach that detects important changes between page versions. This approach consists in archiving the visual layout structure of a web page represented by semantic blocks. This work seek...

Myriam Ben Saad, Stéphane Gançarski

Real-time Traffic