Crawling the web for structured documents

15 years 3 months ago

Download www.mendeley.com

Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like SourceForge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections...

Julián Urbano, Juan Loréns, Yorgos A

Real-time Traffic

CIKM 2010 | Documents | Information Technology | Structured | XML Documents |

claim paper

» Detecting nearduplicates for web crawling

» Random web crawls

» UCYMICRA Distributed Indexing of the Web Using Migrating Crawlers

» Evaluation Methods for Focused Crawling

» Focused Crawling Using Context Graphs

» Intelligent crawling on the World Wide Web with arbitrary predicates

» Distributed Indexing of the Web Using Migrating Crawlers

» OntologyFocused Crawling of Web Documents

Post Info
More Details (n/a)

Added	21 Mar 2011
Updated	21 Mar 2011
Type	Journal
Year	2010
Where	CIKM
Authors	Julián Urbano, Juan Loréns, Yorgos Andreadakis, Mónica Marrero

Comments (0)

Sciweavers

Crawling the web for structured documents

CIKM 2010 | Documents | Information Technology | Structured | XML Documents |

Explore & Download

Productivity Tools

Sciweavers