Stop & Think: Robots.txt might not be the future

by Matt Morgan on Thursday 7 March 2013
The Robots Exclusion Protocol (REP) is a convention of web standards that regulate search engine indexing, inhibiting cooperating web robots from accessing specific pages of a website which the webmaster wishes to preclude from the SERPs. In this article we will assess the use of robots.txt within ecommerce websites as method of preventing pages from being indexed by search engines, comparing it to gains which can be made from utilising the rel=”canonical” tag alongside meta robots tags in its place.

The Robots Exclusion Protocol (REP) is a convention of web standards that regulate search engine indexing, inhibiting cooperating web robots from accessing specific pages of a website which the webmaster wishes to preclude from the SERPs.

In this article we will assess the use of robots.txt within ecommerce websites as method of preventing pages from being indexed by search engines, comparing it to gains which can be made from utilising the rel=”canonical” tag alongside meta robots tags in its place. The REP as far as search engine indexing is concerned, is complemented by the sitemap XML protocol. This allows a webmaster to inform a search engine which pages are available for crawling, allowing the search engine to quickly and intelligently crawl the site, helping bring up the pages you wish to rank in the results page.

 

Robots.txt

A robots.txt is a text file that webmasters generate to instruct robots on how to crawl and index pages on their website. Typically, webmasters will use REP to disallow certain URLs in an attempt to block web crawlers from indexing specific content. Whilst there is no doubting the ease in which you can implement a robots.txt file, there are some misconceptions concerning the effectiveness of its implementation.

One common misconception is that the robots.txt file will prevent a ‘disallowed’ page from being indexed. Below is an example of Next.co.uk using the robots.txt file to block webcrawlers from indexing the maternity directory in the SERPs:

However, notice how the maternity directory still appears indexed in Google:

Even when using the robots.txt file, pages can still be added to supplementary indexes as Google will want to return relevant results disallowed or not. Moreover, if another website or blog links to this URL then any link juice is wasted as the page is blocked from being crawled by the robot.txt file, whilst it is still indexed.  Robots.txt files are not inbuilt within a site’s code like the meta robot tags, and because of this their directives are not commands and consequently robots can ignore your REP. This leaves you open to malware robots and email address harvesters.

 

Canonical Tag

The canonical tag, especially for ecommerce SEO, enables you get around the problem of duplicate content available on different URL landing pages. E-commerce sites are different to others because of this, with many people filtering their searches using search parameters, instead of following through, link by link, exactly what they are looking for, thus creating duplicate content.

Take a clothing retailer for example. Many people will search for specific groups of items, or even one item itself, whilst filtering their chosen items by size, price colour, etc. This means that there will be numerous pages on the site that you will not want to be indexed, but are still necessary to give the user a streamlined online shopping experience.

The canonical tag allows link equity to be passed through pages that you don’t want to be indexed, but may still have been linked to by bloggers or other sites. Thus Google will display the ‘canonical page’ in the SERP, meaning that target pages rank higher for your target keywords against competitors, and means that the site is not competing against itself to rank for said keywords, which duplicate content appearing as spammed search results would do.

An example of this can be seen in the ‘New In’ section on a site, featuring new items which will also have different URLs wherever else the item is listed. Below are three identical pages (duplicate content as far as Google is concerned) with different URLs, one being the ‘click through’ canonical page, the second from the ‘New In’ section, and the third the same page through Asos’ internal search. The pages are identical, but thanks to the canonical tag, instead of Google seeing duplicate content, dispersing link juice away from the page you want to rank for, the ‘canonical tag’ passes any accrued link equity to the canonical page, which then passes it on to not only the site, but to the other ‘canonicalised’ pages.

Source showing that this page URL should be canonicalised to the canonical UR

Source showing that this page URL should be canonicalised to the canonical UR

Meta Robot Tags

As has been established, there is little reason for webmasters of ecommerce sites to rely on robots.txt to block robot access to specific pages webmasters don’t want a search engine to crawl. At best it is a guide which can be ignored, and at worst link any link equity from disallowed pages is lost, whilst those external websites linking back to said pages, cause Google to index the pages regardless.

The robots meta tag acts as a command and is built into the site architecture, giving the search engine no choice but to adhere to the coding, whilst also being far more flexible and rewarding than its rigid robots.txt counterpart. There are three variants of the robots meta tag, ensuring you receive any link equity accrued from the accumulation of external links from pages you don’t want indexed. Thus instead of simply blocking pages, as with the canonical tag and the Next.co.uk example, your website can benefit from any link juice these pages may gain, without being harmed by duplicate content. As with any meta tag, it should be placed in the HEAD segment of the mark up as follows: 

Using the robots meta “noindex, follow” tag allows you collect any link juice from pages that you don’t want indexed, telling the robots to follow all external links to this page, without indexing, removing the need to ever resort to robots.txt files.

As Google is able to crawl the page, through the meta REP tag it is instructed to follow internal links on the page, therefore distributing link equity that the page has accrued to different areas of the site. Through the meta robots tag you are also able to index pages, without following any links to those pages, as well as completely excluding any crawling, indexing or link equity gained by pages attributed with the “noindex, no follow” tag.

Consequently, it is strongly argued that through the use of the canonical tag combined with the meta robots tags, it is not necessary to ever use robots.txt to filter out pages you don’t want indexed from the SERPs.