What is robots.txt in SEO? Understanding what you need to know for your website

Stumped by robots.txt files on your website, why they’re important and what you need to do about it?

We know technical SEO can get a little tricky, but don’t worry.

We’ve put together a guide so that you can get to grips with robots.txt files.

What is a robots.txt file?

A robots.txt file is a crawl directive. So, it’s a way of telling Google, or any search engine, what you want it to be able to access on your website.

What does a robots.txt file look like?

The file is part of the code and includes a piece of syntax which tells the search engine the rule for that page, which is why it’s critical that you get it right. If you get this syntax wrong, you could seriously damage your website’s performance. It will look something like this:

example of robots.txt file

So, what are the rules? How does robots.txt work?

You can use robots.txt to tell which search engine you want to access different parts of that page. Here is a better insight into each robots.txt example, including the robots.txt crawl delay and disallow.

 

Robots.txt examples

What’s the rule? What does it mean? Here’s a robots.txt example
Disallow the full website from being crawled No search engine will be able to get to your website User-agent: * Disallow: /
Disallow the crawl of the calendar and the content on it No search engine will be able to access the particular parts of your site. You can do this with any page you might have secure information on. User-agent: *
Disallow: /calendar/
Disallow the crawl of a particular page No search engine will be able to access that exact page User-agent: *
Disallow:/webpagename/
Disallow a particular search engine from crawling a specific page Only that search engine won’t be able to access that page. In this instance, Bing and others will be able to, but Google won’t. User-agent: Googlebot
Disallow: /webpagename/
Delay particular search engine crawling the page/site for a given number of seconds The search engine will wait that number of seconds before it crawls the page you’ve set the limit on Crawl-delay: 10

 

 

Understanding robots.txt crawl delay

You can set your crawl delay in Google Search Console, by going to your Site Settings and you’ll see the below.

example of robots txt crawl delay in google search console

This should be used as a temporary fix when something isn’t quite right on your site, so that a given search engine doesn’t get overwhelmed.

How to find robots.txt file on my website 

Your site should only ever have one robots.txt file, and you can add in all of your rules to that file. Go to your website domain, plus “robots.txt”. So that would look like: https://www.example.co.uk/robots.txt

If you return a 404 error page, you need to fix this by creating a robots.txt file.

How to create robots.txt file for SEO

In WordPress, go to ‘SEO’ in the left hand pane, and then click ‘Tools’.

You should be able to access your ‘file editor’, and then ‘create a robots.txt file’, which will take you to your file, and you can add in the rules you want to apply.

Alternatively, you can open a new Notepad, create your robots.txt file in there and then add it into your CMS.

Like the WordPress example above, some content management systems give you a helpful plugin to do so.

How to check which pages Google is and isn’t crawling

Use Google’s robots.txt tester tool by submitting a URL to it. Remember, this will only show you if Google is crawling your pages you do and don’t want it to. If you have disallow robots.txt files directed at other search engines like Bing, this tool won’t show you this.

Why use robots.txt files on your website?

Google has what’s called a ‘crawl budget’. That means it effectively has a limit of how many pages it will crawl. It has two parts:

1 – Crawl rate limit: Google will begin to crawl the pages on your website, and then depending on your site’s crawl health, its crawl rate changes.

For example, if your site responds very slowly or has errors, Googlebot could end up crawling less pages.

You can set limits in GSC too but remember that a higher crawl rate won’t necessarily mean more crawling.

2 – Crawl demand: Google will crawl more popular URLs to make sure new or updated content is crawled and indexed.

What affects Google’s crawl budget?

You want to make sure that the pages Google crawls are pages that are worth crawling. So, if it crawls any of the below, it’s wasting its time, and will thus affect the crawl budget.

  • Duplicated content
  • Soft 404 error pages
  • Poor quality pages
  • Faceted URLs, eg filtered by price or colour

The bottom line is that you use robots.txt to help Google crawl your site effectively.

Frequently asked questions about robots.txt files

Why is Google saying my page is indexed, though blocked by robots.txt?

If your page is being blocked by a robots.txt file, it can still rank if it’s being linked to elsewhere. But it will look a little weird in the listings.

Instead, if you do want your page to rank, we’d recommend that best practice is to allow Google to crawl it, and fully optimise your page for SEO.

If you don’t want your page to be crawled, indexed or ranked, use a ‘noindex’ tag instead.

Which pages should I block with robots.txt?

  • Thank you pages – you won’t want these to appear in the listings, so add a ‘noindex’ as well
  • The only exception of duplicate content – if you’ve got page ‘A’ and ‘B’, and ‘B’ is a printer-friendly version of ‘A’, add a robots.txt file to disallow google from accessing page ‘B’, so it doesn’t waste its crawl budget
  • Admin pages – it’s worth adding a ‘noindex’ too because you don’t want this to appear in the SERPs

Can a user still land on my page if it’s blocked by robots.txt?

Yes. If it’s linked to from any other place on your website or an external source, anyone can still access that page.

Creating your robots.txt file? Top 5 brutal mistakes to steer clear of

  • Don’t put anything in uppercase – it only works in lowercase
  • Don’t disallow a page and still link to it on site – Google will still crawl it
  • Don’t use ‘user-agent: Googlebot’ followed by ‘disallow: /’ because this will block your whole site from being crawled
  • Don’t put your file anywhere else other than in the main directory of your website – do ‘www.yourwebsite.com/robots.txt’ instead of ‘www.yourwebiste.com/products/robots.txt’
  • Don’t have an empty line in the space for user-agent – ‘*’ means all search engines

Remember, if you don’t want Google to get access to a given page, add a disallow to your robots.txt file.

And if you don’t want it to rank, add a ‘noindex’ too because it could be linked to anywhere on your site or elsewhere.

Those are the basics on what a robots.txt file does, why you should use it, and how you can go about implementing it on your own website.

If you’d like to understand more about how the different aspects of SEO work, you can visit our digital marketing blog.

And make sure you’re in the loop too on our Facebook, Instagram and Twitter platforms!

Written by Katie McDonald in Digital Marketing