A larger scale study of robots.txt

15 years 3 months ago

Download www2008.org

A website can regulate search engine crawler access to its content using the robots exclusion protocol, specified in its robots.txt file. The rules in the protocol enable the site to allow or disallow part or all of its content to certain crawlers, resulting in a favorable or unfavorable bias towards some of them. A 2007 survey on the robots.txt usage of about 7,593 sites found some evidence of such biases, the news of which led to widespread discussions on the web. In this paper, we report on our survey of about 6 million sites. Our survey tries to correct the shortcomings of the previous survey and shows the lack of any significant preferences towards any particular search engine. Categories and Subject Descriptors: H.3.3 [Information Search and Retrieval]: Search Process General Terms: Experimentation, Measurement

Santanu Kolay

Real-time Traffic

Internet Technology | Particular Search Engine | Robots Exclusion Protocol | Search Engine Crawler | WWW 2008 |

claim paper

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2008
Where	WWW
Authors	Santanu Kolay

Comments (0)

Sciweavers

A larger scale study of robots.txt

Internet Technology | Particular Search Engine | Robots Exclusion Protocol | Search Engine Crawler | WWW 2008 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers