Robots.txt: Why You Need To Care & How To Set It Up Properly

Robots.txt is a file website owners use to communicate with search engine crawlers and tell them which parts of their site they should and shouldn’t index.

It also tells the crawler which pages are private so that users cannot access them without authorization. This article gives you information about Robots.txt, why we need it in SEO, and how to set it up for your website or blog.

What is a robots.txt file?

To tell web robots (usually search engine robots) how to explore their website, web admins generate a text file called robots.txt. The robots exclusion protocol (REP), a set of online standards that govern how robots explore the web, access and index content, and then offer it to users, includes the robots.txt file.

The REP also contains instructions on how search engines should approach links (such as “follow” or “nofollow”), as well as directives like meta robots.

Robots.txt files specify which user agents (web-crawling software) can crawl which website areas. The behavior of some (or all) user agents is “disallowed” or “allowed” according to these crawl instructions.

How does robots.txt work?

Search engines have two key responsibilities:

Crawling the web to find material.
Indexing that content so that the information seekers may access it.

Search engines crawl websites by following links from one site to another, spanning many billions of connections and domains. The term “spidering” is sometimes used to describe this crawling motion.

The search crawler will seek a robots.txt file after visiting a website but before spidering it. If there is one, the crawler will read it before moving on to the rest of the page.

The information found in the robots.txt file will direct additional crawler action on this specific site because it contains instructions on how the search engine should crawl. The user-agent will continue to crawl other content on the site if the robots.txt file does not contain directives that forbid user-agent activity (or if the site does not have a robots.txt file).

The purpose of a robots.txt file

A robots.txt file is a text file that contains instructions for bots on which pages they can and cannot access. This file is used to prevent bots from crawling and indexing pages that are not meant to be seen by the public. A robots.txt file can also be used to create sitemaps and lists of all the pages on a website that are meant to be indexed by search engines.

Why is Robots.txt important?

A robots.txt file is not typically required for websites.

That’s because Google can typically identify and index your website’s key pages.

Additionally, they will automatically omit duplicate or unimportant pages from their indexing.

However, there are three primary justifications for using a robots.txt file:

Block private pages: Occasionally, you may have pages on your site that you don’t want search engines to index. You might have a staging version of a page, for instance. Perhaps a login page. These pages are essential. However, you want to ensure guests can land on them. In this situation, you would use robots.txt to prevent bots and search engine crawlers from accessing specific pages.
Increase crawl budget: If it’s difficult for all your pages to index, you may have a crawl budget issue. Robots.txt allows Googlebot to focus more of your crawl budget on the pages that are genuinely important by banning unimportant pages.
Stop resources from being indexed: Meta directives can be used to stop pages from being indexed just as effectively as Robots.txt. However, multimedia resources like PDFs and photos respond poorly to meta directives. In this situation, robots.txt is useful.

The final word? Search engine spiders are instructed not to crawl particular website pages by robots.txt.

You can see how many pages are indexed in the Google Search Console.

You don’t need to worry about a Robots.txt file if the number matches the number of pages you want crawled.

It’s time to make a robots.txt file for your website if that number is more significant than you anticipated (and you find URLs that shouldn’t be crawled are indexed).

Robots.txt vs. meta robots vs. x-robots: What’s the difference

Numerous robots! What distinguishes these three kinds of robot instructions? First, x-robots and meta are meta directives, whereas robots.txt is an actual text file. Beyond what they indeed are, each of the three fulfills a distinct purpose. Robots.txt specifies how a site or directory should be crawled, whereas meta and x-robots can determine how a page’s (or a page element’s) indexation should be done.

Robots.txt Use Cases

Robots.txt files play a crucial role in managing how search engines interact with your website. Let’s explore the main use cases in detail:

Protect private content: You might have pages on your site that you don’t want to appear in search results. For example, a staging environment, admin pages, or user account areas. Use robots.txt to tell search engines not to crawl these pages.

Example:

User-agent: *
Disallow: /admin/
Disallow: /user-accounts/

Manage crawl budget: Search engines allocate a certain amount of time and resources to crawl your site. This is your “crawl budget.” If you have a large site, you want search engines to focus on your most important pages. Use robots.txt to guide them.

Example:

User-agent: *
Disallow: /old-products/
Disallow: /archived-news/

This tells search engines to skip outdated sections, allowing more time for crawling current, relevant pages.

Prevent indexing of non-public resources: Your site might generate pages that you don’t want in search results, like internal search results or printer-friendly versions of pages. Block these with robots.txt to avoid duplicate content issues.

Example:

User-agent: *
Disallow: /search/
Disallow: /print-version/

Control duplicate content: If you have multiple versions of the same content (e.g., with and without www, or http and https versions), use robots.txt in combination with canonical tags to guide search engines to the preferred version.

Example:

User-agent: *
Disallow: http://example.com/
Allow: https://www.example.com/

Manage bandwidth usage: If your server can’t handle frequent crawling, use the crawl-delay directive to limit how often search engines access your site.

Example:

User-agent: *
Crawl-delay: 10

This tells search engines to wait 10 seconds between each page they crawl.

Remember, while robots.txt is powerful, it’s not a security measure. Malicious bots might ignore it, so don’t rely on it to protect sensitive information.

How to create a robots.txt file

A robots.txt file is a text file that tells search engine crawlers which URLs they can access on your site. This is used mainly to avoid overloading your site. You can create a new robots.txt file using your chosen plain text editor. (Remember, only use a plain text editor!) If you already have a Robots.txt file, you can learn how to create a robots.txt file by following the instructions in this video tutorial.

Creating a robots.txt file is a great way to control what search engine crawlers index on your site. Creating this valuable tool can improve crawling and even impact SEO.

Create robots.txt on your WordPress site with Rank Math

If you are using either Yoast or Rank Math, they have a robots.txt section. By default, it has the user-agent disallow wp-admin. But you would want to include your sitemap to robots.txt like so:

Create robots.txt via FTP

Create a file named robots.txt.
Add rules to the robots.txt file.
Upload the robots.txt file to your site. You can use Filezilla or cPanel from your hosting provider.
Test the robots.txt file.

Where do I put robots.txt?

The root of the website host to which the robots.txt file applies must contain the file. For example, the robots.txt file must be located at http://www.example.com/robots.txt to restrict crawling on any URLs below http://www.example.com/.

It is not permitted to be put in a subdirectory (such as at http://example.com/pages/robots.txt). Contact your web hosting company if you need help or have questions about accessing your website’s root. Use a different blocking strategy, such as meta tags, if you cannot access your website’s source.

Technical robots.txt syntax to know

Robots.txt syntax can be considered the “language” of robots.txt files. There are five standard terms you’re likely to come across in a robot file.

Here is the TL;DR version:

User-agent: The web crawler to which you provide a crawl command (usually a search engine). You may get a list of the majority of user agents here.
Allowing a user agent to be told not to crawl a specific URL is prohibited. For each URL, just one “Disallow:” line is permitted.
“Allow:” is a command that tells Googlebot it can access a page or subdirectory even though its parent page or subfolder may not be.
How long should a crawler wait before loading and navigating to a page’s content? Although it is possible to set the crawl rate in Google Search Console, Googlebot does not respond to the “crawl-delay:” command.
Sitemap: Used to indicate where any XML sitemap(s) connected to this URL can be found. Only Google, Ask, Bing, and Yahoo support this command.

Here is the longer version of the directives in the robots.txt file:

User-agent Directives

A robots.txt file is a set of directives that specify which content on a website can be accessed by which user agents. User-agent directives are written for specific bots, and each directive applies to the bot it is written for.

Google’s crawlers determine the correct group of rules by finding in the robots.txt file the group with the most specific user agent that matches the user agent of the crawler. The crawler will apply the directives from the matched group if there are multiple groups of directives with different user agents.

Sitemap Directive

A robots.txt file is a text file that tells search engine crawlers which URLs on your site they can access. This is used mainly to avoid overloading your site with too many requests. The sitemap directive is used to declare a link to the site map’s XML file(s). This directive is intended to notify search engines about the sitemap’s location so they can crawl it more efficiently.

Crawl-delay Directive

The “crawl-delay” directive in a robots.txt file helps manage web crawler activities by specifying their crawl rate. This is important to help prevent the overloading of the web server. The crawl-delay directive is an unofficial addition to the standard, and not many search engines support it. However, for those that do, it can be a helpful tool in managing web crawler traffic.

Allow and Disallow Directives

A robots.txt file is used to tell search engine crawlers which URLs they can access on your site. This is mainly to avoid overloading your site. The allow and disallow fields are also called rules (or directives). These rules are always specified in the form of rule: [path], where [path] is the directory or specific page you want to Allow or Disallow.

The “Allow” directive allows search engines to crawl a subdirectory or specific page, even in an otherwise disallowed directory. For example, if you have a robots.txt file that contains directives for search engines, you can use the Allow and Disallow directives together to tell them which pages they can crawl and which they can’t.

How are the allow and disallow directives implemented? Here are a few examples of how to use robots.txt Allow/Disallow:

Blocking all web crawlers from all content

User-agent: * Disallow: /

All web crawlers would be instructed not to access any pages on www.example.com, including the homepage, if a robots.txt file had this syntax.

Allowing all web crawlers access to all content

User-agent: * Disallow:

This syntax instructs web spiders to crawl every page on www.example.com, including the home page, when it is used in a robots.txt file.

Blocking a specific web crawler from a specific folder

User-agent: Googlebot Disallow: /example-subfolder/

Only Google’s crawler (user-agent name Googlebot) is instructed not to crawl any pages with the URL www.example.com/example-subfolder/ by using this syntax.

Blocking a specific web crawler from a specific web page

User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html

This syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page at www.example.com/example-subfolder/blocked-page.html.

Pattern-matching

Robots.txt files can become quite complicated regarding the actual URLs to block or allow because they enable pattern-matching to cover a variety of potential URL alternatives. Two regular expressions that can be used to identify pages or subfolders that an SEO wants to omit are recognized by both Google and Bing. These two symbols are the dollar sign ($) and the asterisk (*).

Any string of characters is represented by the wildcard character *.
$ matches at the end of the URL.

The set of potential pattern-matching syntax and examples provided by Google is excellent.

Here are a few useful robots.txt rules from Google.

Other quick robots.txt must-knows:

A robots.txt file needs to be placed in the website’s top-level directory to be found.
Robots.txt is case-sensitive. Therefore, the file’s name must end with “.txt” (not Robots.txt, robots.TXT, or otherwise).
Your robots.txt file may be ignored by some user agents (robots). This is typical of the more malicious crawlers, such as email address scrapers and malware robots.
The /robots.txt file is publicly accessible; append /robots.txt to any root domain to view the robots.txt file for that website. This means that anyone can see the pages you choose to crawl or not, so avoid using them to conceal sensitive user data.
A parent domain’s subdomains each have their robots.txt file. The robots.txt files for blog.example.com and example.com should be located at blog.example.com/robots.txt and example.com/robots.txt, respectively.
The location of any sitemaps linked to this domain should typically be noted at the bottom of the robots.txt file. Here’s an illustration:

Grouping in Robots.txt

Grouping in robots.txt allows you to apply different rules to different search engine bots. This is useful when you want certain bots to access specific parts of your site while restricting others. Here’s a detailed look at how to use grouping effectively:

Structure of a group: Each group starts with a User-agent line, followed by the rules for that user agent. Groups are separated by blank lines.
User-agent specificity: You can specify rules for individual bots (like Googlebot or Bingbot) or use a wildcard (*) to apply rules to all bots.
Multiple user agents: You can apply the same rules to multiple user agents by listing them one after another before the rules.

Here’s an extended example to illustrate these points:

User-agent: Googlebot User-agent: Googlebot-Image Disallow: /private/ Allow: /images/

User-agent: Bingbot Disallow: /admin/ Disallow: /beta/

User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/

In this example:

Google’s web and image crawlers are allowed to access the /images/ directory but not /private/.
Bing’s crawler is blocked from the /admin/ and /beta/ directories.
All other bots are blocked from /tmp/ and /cgi-bin/.

When using grouping, keep these tips in mind:

Be as specific as necessary. Use individual bot names when you need different rules for different search engines.
Use the wildcard group as a catch-all for bots not specifically named.
Keep your groups organized and commented for easy maintenance.

Proper use of grouping helps you fine-tune how different search engines interact with your site, potentially improving your SEO and server management.

Order of Precedence in Robots.txt

Understanding the order of precedence in robots.txt is crucial for effective implementation. Search engines follow specific rules when interpreting these files, and knowing these rules helps you avoid conflicts and achieve your desired outcome.

Here are the key principles of precedence in robots.txt:

Most specific user agent rules take priority:

If you have rules for both a specific bot and a wildcard (*), the specific bot’s rules will apply to that bot.

Example:

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /public/

Here, all bots except Googlebot are blocked from the entire site. Googlebot can access the /public/ directory.

Within a group, the first matching rule applies

Once a matching rule is found for a URL, the search engine stops looking for more rules.

Example:

User-agent: *
Disallow: /private/
Allow: /private/public/

In this case, /private/public/ will still be disallowed because the Disallow rule comes first.

Allow directives override Disallow directives

If a path matches both an Allow and a Disallow pattern, the Allow takes precedence.

Example:

User-agent: *
Disallow: /folder1/
Allow: /folder1/subfolder/

Here, /folder1/subfolder/ will be allowed, even though its parent folder is disallowed.

More specific path rules override less specific ones

A rule with a longer path will override a shorter, conflicting path.

Example:

User-agent: *
Disallow: /page
Allow: /page/

In this case, /page will be disallowed, but /page/ (with a trailing slash) will be allowed.

Use of wildcards

Wildcards can be used in both Allow and Disallow directives. The most specific (longest) matching pattern takes precedence.

Example:

User-agent: *
Disallow: /*.php$
Allow: /index.php

This blocks all PHP files except for index.php.

Understanding these precedence rules allows you to create more sophisticated and precise robots.txt files. Always test your robots.txt file after making changes to ensure it behaves as expected.

How to test your robots.txt file

A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site. Here you’ll see the current robots.txt file and can test new URLs to see whether they’re disallowed for crawling.

To guide your way through complicated robots.txt files, Google created a tool to test and validate your robots.txt. Check if a URL is blocked and how. You can also check if the resources for the page are allowed or disallowed.

Open the tester tool for your site; whether you have one site or multiple, select the property.
Type in the URL of a page on your site in the text box at the bottom.
Select the user-agent you want to simulate in the dropdown list to the right of the text box.
Click the TEST button to test access.

Common Robots.txt Issues

Even experienced webmasters encounter issues with robots.txt files. Understanding these common errors helps maintain an effective robots.txt file for your website. Let’s explore these problems in detail:

Accidental Blocking: Unintentionally blocking search engines from your main content severely impacts SEO. Regularly check your site’s crawl stats in search engine webmaster tools. A sudden drop in crawled pages requires an immediate review of your robots.txt file.

Case Sensitivity: Robots.txt is case-sensitive. “Robots.txt” or “ROBOTS.TXT” won’t work. Always use lowercase when naming your robots.txt file. Double-check the filename after uploading.

Incorrect File Location: Robots.txt belongs in your website’s root directory (e.g., https://www.example.com/robots.txt). Placing it elsewhere makes it ineffective. Verify its location using your browser.

Security Limitations: Robots.txt is a suggestion, not a security measure. Malicious bots may ignore it. Use proper authentication methods to protect sensitive information, never include it in robots.txt.

Outdated Instructions: As your website changes, so do your robots.txt needs. Failing to update the file leads to crawling issues. Review your robots.txt file regularly, especially after major site updates.

Syntax Errors: Typos or incorrect formatting cause issues. Use a robots.txt validator tool to check your file for syntax errors.

Blocking Resources: Preventing access to CSS, JavaScript, or image files harms your site’s rendering and indexing. Ensure your robots.txt file doesn’t block these resources. If necessary, be specific about what you block.

Wildcard Misuse: Misusing wildcards leads to unintended blocking or allowing of pages. Test your wildcard patterns thoroughly using tools like Google’s robots.txt Tester.

Subdomain Inconsistency: Each subdomain needs its own robots.txt file. Create and maintain separate files for each one if you use subdomains.

Crawl-Delay Limitations: Not all search engines support the crawl-delay directive. Manage crawl rate using specific settings in search engine webmaster tools, in addition to crawl-delay.

SEO best practices

Verify that no content or areas of your website that you want to be crawled are being blocked.
Links on pages that robots.txt has blacklisted won’t be followed. This means that
- 1. The linked resources won’t be crawled and may not be indexed unless they are also linked from other search engine-accessible pages (i.e., pages not banned by robots.txt, meta robots, or other techniques).
- 2. The banned page cannot convey any link equity to the link destination. Use a separate blocking process if you have pages that you want equity to be transferred to rather than robots.txt.
Robots.txt should not be used to prevent private user information or other sensitive data from showing up in SERP results. The personal information page may still be indexed since other pages may link straight to it (bypassing the robots.txt directives on your root domain or homepage). Use a different technique, such as password protection or the noindex meta directive, to prevent your website from appearing in search results.
Some search engines use multiple user agents. For instance, Googlebot and Googlebot-Image are used for organic and image searches. Although it’s not necessary to provide directives for each of a search engine’s several crawlers because most user agents from the same search engine adhere to the same rules, having the option does let you fine-tune how your site’s content gets indexed.
A search engine will cache the contents of the robots.txt file, but the contents are typically updated at least once daily. You can submit your robots.txt URL to Google if you make changes to the file and want it updated more quickly than is currently happening.
Make It Simple to Find Your Robots.txt File. It’s time to publish your robots.txt file after you have it. The robots.txt file can technically be located in any primary directory on your website.
But I advise putting your robots.txt file at: https://example.com/robots.txt to enhance the likelihood that it gets detected.
(Remember the case sensitivity of your robots.txt file. Therefore, ensure that the filename has a lowercase “r”)

Conclusion

A Robots.txt file is an important tool for web admins who want to control how search engine crawlers (also called robots or spiders) crawl their website and its pages. This file helps prevent crawlers from accessing unauthorized areas of your site and improves its performance by preventing duplicate content from being indexed by crawlers.

If you don’t want your site crawled by search engine bots because it could result in an overload of traffic or simply because you don’t want them to see certain pages, then I hope this post helped you understand how to stop crawling by setting up a Robots.txt file.

Core SEO Tools

Link Optimization

Advanced Analysis

More info

Compare us

Robots.txt: Why You Need To Care & How To Set It Up Properly

What is a robots.txt file?

How does robots.txt work?

The purpose of a robots.txt file

Why is Robots.txt important?

Robots.txt vs. meta robots vs. x-robots: What’s the difference

Robots.txt Use Cases

How to create a robots.txt file

Create robots.txt on your WordPress site with Rank Math

Create robots.txt via FTP

Where do I put robots.txt?

Technical robots.txt syntax to know

User-agent Directives

Sitemap Directive

Crawl-delay Directive

Allow and Disallow Directives

Blocking all web crawlers from all content

Allowing all web crawlers access to all content

Blocking a specific web crawler from a specific folder

Blocking a specific web crawler from a specific web page

Pattern-matching

Other quick robots.txt must-knows:

Grouping in Robots.txt

Order of Precedence in Robots.txt

How to test your robots.txt file

Common Robots.txt Issues

SEO best practices

Conclusion

About the Author: Jay - Linkilo

See What Others Are Reading

Tired of Manual Internal Linking?