This paper describes the current state of RUgle, a system for classifying and indexing papers made available on the World Wide Web, in a domain-independent and universal manner. B...
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive...
Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay ...
Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times....
One of the key components of current Web search engines is the document collector. This paper describes CoBWeb, an automatic document collector, whose architecture is distributed ...
Altigran Soares da Silva, Eveline A. Veloso, Paulo...
Many Web services operate their own Web crawlers to discover data of interest, despite the fact that largescale, timely crawling is complex, operationally intensive, and expensive...
Jonathan M. Hsieh, Steven D. Gribble, Henry M. Lev...