A web page may be relevant to multiple topics; even when nominally on a single topic, the page may attract attention (and thus links) from multiple communities. Instead of indiscr...
We present a highly accurate method for classifying web pages based on link percentage, which is the percentage of text characters that are parts of links normalized by the number...
Consider a rooted tree T of arbitrary maximum degree d representing a collection of n web pages connected via a set of links, all reachable from a source home page represented by ...
The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods suc...
Karane Vieira, Altigran Soares da Silva, Nick Pint...
For many companies and/or institutions it is no longer sufficient to have a web site and high quality products or services. What in many cases makes the difference between success...
This paper provides an explanation of the basic data structures used in a new page analysis technique to create wrappers (data extractors) for the result pages produced by web sit...
Institutions and companies that are based in countries where the main language is not English typically publish Web sites that offer the same information at least in the local lan...
Filippo Ricca, Paolo Tonella, Emanuele Pianta, Chr...
The collective contributions of billions of users across the globe each day result in an ever-changing web. In verticals like news and real-time search, recency is an obvious sign...
Search trails comprising queries and Web page views are created as searchers engage in information-seeking activity online. During known-item search (where the objective may be to...
Abstract. Automated language identification of written text is a wellestablished research domain that has received considerable attention in the past. By now, efficient and effecti...