Sciweavers

JCDL
2005
ACM

What's there and what's not?: focused crawling for missing documents in digital libraries

14 years 6 months ago
What's there and what's not?: focused crawling for missing documents in digital libraries
Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors’ self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue. We investigate the feasibility of using publication metadata to guide the crawler towards authors’ homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability o...
Ziming Zhuang, Rohit Wagle, C. Lee Giles
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where JCDL
Authors Ziming Zhuang, Rohit Wagle, C. Lee Giles
Comments (0)