robots.txt_file_optimisation

How to use robots.txt file effectively?

Google wants to make a robots.txt an official Internet standard. The file is used to set a number of rules for search engines. It is important for SEO and sometimes minor misconfigurations can bring more harm than good. Read on to learn more about its use, syntax and best optimisation methods.

What is a robots.txt file? 

A robots.txt file contains directives for web crawlers. It is the very first file that the bots check before they can crawl your website. It is because they need instructions on what they are permitted to access. Commands in robots.txt file are mainly used to prevent spiders from accessing certain pages and subfolders on the site.

How to find robots.txt?

There are a couple of ways to access the robots.txt file. You most likely will have to log in to your hosting account and navigate to cPanel (through File Manager). From there, you will be able to locate this file in your public_html folder. In WordPress, you can use the Yoast plugin that allows to create and edit robots.txt file directly from your CMS.

How to use robots.txt? 

You can write a variety of rules in your robots.txt file. The disallow: directive is one of the most commonly used ones as it allows you to block the bots from accessing certain folders or pages on your website. Then, there is the hash symbol that is used for comments. 

See the simple structure of the robots.txt file (the example comes from The Art of SEO Mastering Search Engine Optimization: third edition, page 306) 

Example of basic robots.txt file:

User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow: /

# Block all robots from tmp and logs directories
User-agent: *
Disallow: /tmp/
Disallow: /logs # for directories and files called logs 

The above example will do the following:

  • allow Googlebot to access to the entire site
  • entirely prevent BingBot from accessing the site
  • block the bots (apart from Googlebot) from crawling the pages sitting in the /tmp/ folder and block the access to the directories or files called /logs

The behaviour of Googlebot is not affected by the directives set for all robots (those that use an asterisk). Check the overview of Google crawlers (user agents) if you want to block Google from accessing certain pages on your website.

Below, you will find explanations of how to write rules for robots.txt file with some examples:

Example 1: 'empty space', 'trailing slash', and 'hash' sign

If you leave an empty space, you allow search engines to go anywhere on your website. The stand-alone trailing slash blocks the bots from accessing your entire site. And, the hash sign is great for comments (very commonly used for bigger sets of directives; it allows to add an annotation on the applied directives).

Folders & subfolders

Example 2: /market

If you were to set a disallow rule for /market on your site, you would block the bots from accessing pages that reside in this specific folder. Plus all URLs containing the word ‘market’ in a string as a stand-alone word and any URL where the word ‘market’ is a part of a different word such as marketing in the URL path. This is because the path remains open (no trailing slash). In consequence, both folders /market/business-review/ and /marketing/design-ideas/ are set to be blocked from crawling. 

Example 2.1: /market*

In this example, the wildcard has been added but is ignored. This example is equivalent to the previous one. As a rule, the asterisk wildcard character is used for pattern matching but because the path remains open, it does not change anything.

Example 2.2:  /market/

The trailing slash has been added here (the path is closed now). Therefore, only pages that sit under the /market/ folder will be prevented from crawling. Let’s use the previous example (Example 2) to emphasise the difference. For instance, pages in the folder /market/business-review/ will be blocked from crawling. But, pages in the folder /marketing/design-ideas/ will remain available to search engine bots with this directive. 

Example 3: /*.php

Here, a wildcard indicates that every URL that contains the .php extension will be blocked from crawling (pattern matching). Some examples are /filename.php, /folder/filename.php, /folder/filename.php?parameters, /filename.php/, etc.

Example 3.1: /*.php$

The dollar sign means that all URLs that end with the ‘.php’ extension will be prevented from crawling. In the previous example (Example 3) we can see that the URLs do not have to end with the ‘.php’ to be disallowed from crawling. They just need to contain it within the string. It is the dollar sign that makes a difference here (you can match the end of the string using the dollar sign).

Example 4: /fish*.php

The wildcard is used here to determine that all URLs containing the word ‘fish’ will be blocked from crawling. Of course all URLs with the ‘.php’ extension within the given string. Consequently, pages residing in /fish.php, /fishheads/catfish.php?parameters, etc. will be disallowed from crawling. 

Note: The directives are case sensitive. It means that disallowing the folder /marketing/ from crawling would still leave the folder /Marketing/ open to search engine bots. It’s good to remember this while writing the rules.

What does allow mean in robots.txt?

The allow directive works the opposite to disallow. It is supported by Google and Bing. This rule is usually implemented to override a previous disallow directive. This is especially useful when bigger sections of the site have been disallowed but we still want to allow certain pages from those sections to be discovered and crawled by Google and other search engines.

Noindex directive, rel=”nofollow” attribute, XML sitemaps, and meta robots tag

Noindex directive tells search engines to exclude a page from search results. The page can still be crawled (unlike in the instance of the ‘disallow’ directive) but it will not show in SERPs.

There has been a lot of confusion around the rel=”nofollow” link attribute. Publishers used to assign it to most or all outbound links (links that point to some other domain from your site). It is because they wanted to restrict the passing of link value between web pages (if you link to a domain other than yours, you are signalling to Google and other search engines the importance of that domain; consequently its authority rises up). All in all, there is nothing wrong with linking to other websites without setting the rel=”nofollow” link attribute if you truly believe that the sources you are referencing are beneficial to your users. If done in moderation, it causes no harm. The nofollow attribute can be used for paid links, though. Google believes that freely provided editorial links are the only links that should contribute towards bolstering a site’s/page’s rankings. 

XML sitemaps (for your valid HTML pages, images, and videos) should ideally be added to robots.txt file to speed up the discovery of all the important content on your website. Links to your valid XML sitemap/s can be placed anywhere in the robots.txt file.

Let’s now cover the robots meta tag. You can add the following combinations to the section of your page:

<meta name = "robots" content = "noindex, follow">
<meta name = "robots" content = "noindex, nofollow">
<meta name = "robots" content = "nofollow">

If you need to block access to certain pages on your website, you can choose any of the above options. However, you should avoid sending mixed signals to Google.

Case scenario for URL: www.yourdomain.com/marketing/design-ideas/

Contains the below:

  • <link rel=”canonical” href=”www.yourdomain.com/marketing/design-ideas/”>
  • disallow: /marketing/ (rule added to the robots.txt file)
  • <meta name = “robots” content =”noindex, follow”>

What does the above say to Google?

We are sending mixed signals with the above setup. Firstly, we are telling Google (through the canonical tag) that www.yourdomain.com/marketing/design-ideas/ is the master copy of a page and; therefore, it should be crawled and indexed. Next, we are preventing this page from crawling with the robots.txt disallow directive set for the /marketing/ folder. Finally, we also tell Google that it is okay to crawl this page (opposite to what we set in robots.txt file) but it is not okay to index it. As you can see, there are a lot of contradictory rules here! It is important to keep those things in check while applying any of the above rules to your website. 

What pages could you potentially disallow from crawling and why?

One good example would be an e-commerce website that uses filters. Filtered pages are particularly useful for users because they allow them to view customised results. For search engines, filtered pages are exact replicas of the actual product pages hence they create duplicate content. Another example of pages that could be prevented from search engines’ access is production environment pages. In general, you can block any content that is either private or not valuable for SEO (search query parameter pages, pages with thin content, etc.).

CONCLUSION:

As mentioned in the very beginning, robots.txt file is vital for SEO because it enables to set directives for search engines. Thanks to that, we can optimise our websites better and use the crawl budget effectively. I would highly recommend checking Google documentation on how to create a robots.txt file as well as to validate your existing robots.txt file in a recommended by Google robots.txt validator and testing tool. Summarising, it is always good to make sure that the robots.txt file is used properly and is not blocking important pages from indexing by Google and other search engines. 

Leave a Reply

Your email address will not be published. Required fields are marked *