Robots.txt: Why You Need To Care & How To Set It Up Properly

Robots.txt is a file website owners use to communicate with search engine crawlers and tell them which parts of their site they should and shouldn’t index.

It also tells the crawler which pages are private so that users cannot access them without authorization. This article gives you information about Robots.txt, why we need it in SEO, and how to set it up for your website or blog.

What is a robots.txt file?

To tell web robots (usually search engine robots) how to explore their website, web admins generate a text file called robots.txt. The robots exclusion protocol (REP), a set of online standards that govern how robots explore the web, access and index content, and then offer it to users, includes the robots.txt file.

The REP also contains instructions on how search engines should approach links (such as “follow” or “nofollow”), as well as directives like meta robots.

Robots.txt files specify which user agents (web-crawling software) can crawl which website areas. The behavior of some (or all) user agents is “disallowed” or “allowed” according to these crawl instructions.

How does robots.txt work?

Search engines have two key responsibilities:

Crawling the web to find material.
Indexing that content so that the information seekers may access it.

Search engines crawl websites by following links from one site to another, spanning many billions of connections and domains. The term “spidering” is sometimes used to describe this crawling motion.

The search crawler will seek a robots.txt file after visiting a website but before spidering it. If there is one, the crawler will read it before moving on to the rest of the page.

The information found in the robots.txt file will direct additional crawler action on this specific site because it contains instructions on how the search engine should crawl. The user-agent will continue to crawl other content on the site if the robots.txt file does not contain directives that forbid user-agent activity (or if the site does not have a robots.txt file).

The purpose of a robots.txt file

A robots.txt file is a text file that contains instructions for bots on which pages they can and cannot access. This file is used to prevent bots from crawling and indexing pages that are not meant to be seen by the public. A robots.txt file can also be used to create sitemaps and lists of all the pages on a website that are meant to be indexed by search engines.

Why is Robots.txt important?

A robots.txt file is not typically required for websites.

That’s because Google can typically identify and index your website’s key pages.

Additionally, they will automatically omit duplicate or unimportant pages from their indexing.

However, there are three primary justifications for using a robots.txt file:

Block private pages: Occasionally, you may have pages on your site that you don’t want search engines to index. You might have a staging version of a page, for instance. Perhaps a login page. These pages are essential. However, you want to ensure guests can land on them. In this situation, you would use robots.txt to prevent bots and search engine crawlers from accessing specific pages.
Increase crawl budget: If it’s difficult for all your pages to index, you may have a crawl budget issue. Robots.txt allows Googlebot to focus more of your crawl budget on the pages that are genuinely important by banning unimportant pages.
Stop resources from being indexed: Meta directives can be used to stop pages from being indexed just as effectively as Robots.txt. However, multimedia resources like PDFs and photos respond poorly to meta directives. In this situation, robots.txt is useful.

The final word? Search engine spiders are instructed not to crawl particular website pages by robots.txt.

You can see how many pages are indexed in the Google Search Console.

You don’t need to worry about a Robots.txt file if the number matches the number of pages you want crawled.

It’s time to make a robots.txt file for your website if that number is more significant than you anticipated (and you find URLs that shouldn’t be crawled are indexed).

Robots.txt vs. meta robots vs. x-robots: What’s the difference

Numerous robots! What distinguishes these three kinds of robot instructions? First, x-robots and meta are meta directives, whereas robots.txt is an actual text file. Beyond what they indeed are, each of the three fulfills a distinct purpose. Robots.txt specifies how a site or directory should be crawled, whereas meta and x-robots can determine how a page’s (or a page element’s) indexation should be done.

Technical robots.txt syntax to know

Robots.txt syntax can be considered the “language” of robots.txt files. There are five standard terms you’re likely to come across in a robot file.

Here is the TL;DR version:

User-agent: The web crawler to which you provide a crawl command (usually a search engine). You may get a list of the majority of user agents here.
Allowing a user agent to be told not to crawl a specific URL is prohibited. For each URL, just one “Disallow:” line is permitted.
“Allow:” is a command that tells Googlebot it can access a page or subdirectory even though its parent page or subfolder may not be.
How long should a crawler wait before loading and navigating to a page’s content? Although it is possible to set the crawl rate in Google Search Console, Googlebot does not respond to the “crawl-delay:” command.
Sitemap: Used to indicate where any XML sitemap(s) connected to this URL can be found. Only Google, Ask, Bing, and Yahoo support this command.

Here is the longer version of the directives in the robots.txt file:

User-agent Directives

A robots.txt file is a set of directives that specify which content on a website can be accessed by which user agents. User-agent directives are written for specific bots, and each directive applies to the bot it is written for.

Google’s crawlers determine the correct group of rules by finding in the robots.txt file the group with the most specific user agent that matches the user agent of the crawler. The crawler will apply the directives from the matched group if there are multiple groups of directives with different user agents.

Sitemap Directive

A robots.txt file is a text file that tells search engine crawlers which URLs on your site they can access. This is used mainly to avoid overloading your site with too many requests. The sitemap directive is used to declare a link to the site map’s XML file(s). This directive is intended to notify search engines about the sitemap’s location so they can crawl it more efficiently.

Crawl-delay Directive

The “crawl-delay” directive in a robots.txt file helps manage web crawler activities by specifying their crawl rate. This is important to help prevent the overloading of the web server. The crawl-delay directive is an unofficial addition to the standard, and not many search engines support it. However, for those that do, it can be a helpful tool in managing web crawler traffic.

Allow and Disallow Directives

A robots.txt file is used to tell search engine crawlers which URLs they can access on your site. This is mainly to avoid overloading your site. The allow and disallow fields are also called rules (or directives). These rules are always specified in the form of rule: [path], where [path] is the directory or specific page you want to Allow or Disallow.

The “Allow” directive allows search engines to crawl a subdirectory or specific page, even in an otherwise disallowed directory. For example, if you have a robots.txt file that contains directives for search engines, you can use the Allow and Disallow directives together to tell them which pages they can crawl and which they can’t.

How are the allow and disallow directives implemented? Here are a few examples of how to use robots.txt Allow/Disallow:

Blocking all web crawlers from all content

User-agent: * Disallow: /

All web crawlers would be instructed not to access any pages on www.example.com, including the homepage, if a robots.txt file had this syntax.

Allowing all web crawlers access to all content

User-agent: * Disallow:

This syntax instructs web spiders to crawl every page on www.example.com, including the home page, when it is used in a robots.txt file.

Blocking a specific web crawler from a specific folder

User-agent: Googlebot Disallow: /example-subfolder/

Only Google’s crawler (user-agent name Googlebot) is instructed not to crawl any pages with the URL www.example.com/example-subfolder/ by using this syntax.

Blocking a specific web crawler from a specific web page

User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html

This syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page at www.example.com/example-subfolder/blocked-page.html.

Pattern-matching

Robots.txt files can become quite complicated regarding the actual URLs to block or allow because they enable pattern-matching to cover a variety of potential URL alternatives. Two regular expressions that can be used to identify pages or subfolders that an SEO wants to omit are recognized by both Google and Bing. These two symbols are the dollar sign ($) and the asterisk (*).

Any string of characters is represented by the wildcard character *.
$ matches at the end of the URL.

The set of potential pattern-matching syntax and examples provided by Google is excellent.

Here are a few useful robots.txt rules from Google.

Other quick robots.txt must-knows:

A robots.txt file needs to be placed in the website’s top-level directory to be found.
Robots.txt is case-sensitive. Therefore, the file’s name must end with “.txt” (not Robots.txt, robots.TXT, or otherwise).
Your robots.txt file may be ignored by some user agents (robots). This is typical of the more malicious crawlers, such as email address scrapers and malware robots.
The /robots.txt file is publicly accessible; append /robots.txt to any root domain to view the robots.txt file for that website. This means that anyone can see the pages you choose to crawl or not, so avoid using them to conceal sensitive user data.
A parent domain’s subdomains each have their robots.txt file. The robots.txt files for blog.example.com and example.com should be located at blog.example.com/robots.txt and example.com/robots.txt, respectively.
The location of any sitemaps linked to this domain should typically be noted at the bottom of the robots.txt file. Here’s an illustration:

Where do I put robots.txt?

The root of the website host to which the robots.txt file applies must contain the file. For example, the robots.txt file must be located at http://www.example.com/robots.txt to restrict crawling on any URLs below http://www.example.com/.

It is not permitted to be put in a subdirectory (such as at http://example.com/pages/robots.txt). Contact your web hosting company if you need help or have questions about accessing your website’s root. Use a different blocking strategy, such as meta tags, if you cannot access your website’s source.

How to create a robots.txt file

A robots.txt file is a text file that tells search engine crawlers which URLs they can access on your site. This is used mainly to avoid overloading your site. You can create a new robots.txt file using your chosen plain text editor. (Remember, only use a plain text editor!) If you already have a Robots.txt file, you can learn how to create a robots.txt file by following the instructions in this video tutorial.

Creating a robots.txt file is a great way to control what search engine crawlers index on your site. Creating this valuable tool can improve crawling and even impact SEO.

Create robots.txt on your WordPress site with Rank Math

If you are using either Yoast or Rank Math, they have a robots.txt section. By default, it has the user-agent disallow wp-admin. But you would want to include your sitemap to robots.txt like so:

Create robots.txt via FTP

Create a file named robots.txt.
Add rules to the robots.txt file.
Upload the robots.txt file to your site. You can use Filezilla or cPanel from your hosting provider.
Test the robots.txt file.

How to test your robots.txt file

A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site. Here you’ll see the current robots.txt file and can test new URLs to see whether they’re disallowed for crawling.

To guide your way through complicated robots.txt files, Google created a tool to test and validate your robots.txt. Check if a URL is blocked and how. You can also check if the resources for the page are allowed or disallowed.

Open the tester tool for your site; whether you have one site or multiple, select the property.
Type in the URL of a page on your site in the text box at the bottom.
Select the user-agent you want to simulate in the dropdown list to the right of the text box.
Click the TEST button to test access.

SEO best practices

Verify that no content or areas of your website that you want to be crawled are being blocked.
Links on pages that robots.txt has blacklisted won’t be followed. This means that
- 1. The linked resources won’t be crawled and may not be indexed unless they are also linked from other search engine-accessible pages (i.e., pages not banned by robots.txt, meta robots, or other techniques).
- 2. The banned page cannot convey any link equity to the link destination. Use a separate blocking process if you have pages that you want equity to be transferred to rather than robots.txt.
Robots.txt should not be used to prevent private user information or other sensitive data from showing up in SERP results. The personal information page may still be indexed since other pages may link straight to it (bypassing the robots.txt directives on your root domain or homepage). Use a different technique, such as password protection or the noindex meta directive, to prevent your website from appearing in search results.
Some search engines use multiple user agents. For instance, Googlebot and Googlebot-Image are used for organic and image searches. Although it’s not necessary to provide directives for each of a search engine’s several crawlers because most user agents from the same search engine adhere to the same rules, having the option does let you fine-tune how your site’s content gets indexed.
A search engine will cache the contents of the robots.txt file, but the contents are typically updated at least once daily. You can submit your robots.txt URL to Google if you make changes to the file and want it updated more quickly than is currently happening.
Make It Simple to Find Your Robots.txt File. It’s time to publish your robots.txt file after you have it. The robots.txt file can technically be located in any primary directory on your website.
But I advise putting your robots.txt file at: https://example.com/robots.txt to enhance the likelihood that it gets detected.
(Remember the case sensitivity of your robots.txt file. Therefore, ensure that the filename has a lowercase “r”)

Conclusion

A Robots.txt file is an important tool for web admins who want to control how search engine crawlers (also called robots or spiders) crawl their website and its pages. This file helps prevent crawlers from accessing unauthorized areas of your site and improves its performance by preventing duplicate content from being indexed by crawlers.

If you don’t want your site crawled by search engine bots because it could result in an overload of traffic or simply because you don’t want them to see certain pages, then I hope this post helped you understand how to stop crawling by setting up a Robots.txt file.

More info

Compare us