Chapter 9. Robots Exclusion Protocol

The story of Robots Exclusion Protocol (REP) begins with the introduction of the robots.txt protocol in 1993. This is thanks in part to a Perl web crawler hogging network bandwidth of a site whose owner would become the eventual robots.txt creator (http://bit.ly/bRB3H).

In 1994, REP was formalized by the consensus of a “majority of robot authors” (http://robotstxt.org/orig.html). Originally, REP was only meant to allow for resource exclusion. This has changed over time to include directives for inclusion.

When we talk about REP today, we are talking about several things: robots.txt, XML Sitemaps, robots meta tags, X-Robot-Tag(s), and the nofollow link attribute. Understanding REP is important, as it is used for various SEO tasks. Content duplication, hiding unwanted documents from search results, strategic distribution of link juice, and document (search engine) index removal are just some of the things REP can assist with.

Adoption of REP is nonbinding, and it is not necessarily adopted by all search engines. However, the big three search engines (Yahoo!, Google, and Bing) have adopted a strategy of working together in supporting REP in almost a uniform way, while also working together to introduce new REP standards. The goal of these efforts is to provide consistent crawler behavior for the benefit of all webmasters.

This chapter covers REP in detail. Topics include robots.txt and its associated directives, HTML meta directives, the .htaccess

Get SEO Warrior now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.