This message might appear in Google Search Console: “Indexed, but robot.txt is blocking it.” This indicates that even if your robots.txt file was set to ban a URL, Google crawled it nevertheless.

If you’ve seen the Google Search Console message that reads “Indexed but blocked by robots.txt,” you’ll want to address it as quickly as possible because it may impact your site’s ability to rank in all Search Engine Results Pages (SERPS).

What is a robots.txt file?

Your robots.txt file is a text file that tells robots (search engine crawlers) which pages on your site should look at and which they shouldn’t. It tells them which pages they should not look at. When you “allow” or “disallow” the actions of web crawlers. Many think this means that robots can’t crawl the page and should not be able to index it. This is not true every time.

robots.txt file

What is meant when it says “indexed, yet blocked by robots.txt”?

This message appears in Google Search Console: “index, though block by robots.txt.” This indicates that even if your robots.txt file was set to ban a URL, Google crawled it nevertheless. Because they’re unclear if you want these URLs crawled, Google has designated them as “Valid with a warning.” You’ll discover the solution to this problem in this post. Because Google isn’t sure if you want these URLs to be indexed, they display a warning.

How to locate indexed and blocked robots.txt issues

According to Google’s robots.txt file, your page has been indexed, but the robots.txt file says that Google should not show it in search results. This means that it won’t show up in the search results. Often, this is done on purpose or by accident, and it can be fixed.

You can go to Google Search Console>Coverage. You can see the index, though blocked by robots.txt, in the “Valid with warning” section:

Steps to fix “indexed, yet blocked by robots.txt.”

Here are some steps you can undertake if you encounter: 

Find out the problem behind the notification

The notification could be for several reasons. However, it’s not always a bad thing if there are pages that are blocked by robots.txt. It may have been made for various reasons, such as by a developer who didn’t want to show too many pages or duplicates.

Fix web address that doesn’t work

Sometimes, the problem could be caused by a URL that isn’t a link to a page. If it’s a page with important content that you want your users to see, you should change the URL. This is possible on a CMS like WordPress, where you can change the slug of a page. If the page isn’t important, or if it’s a search query from our blog, there’s no need to fix the GSC error.

Reasons why pages are not indexed

Here are reasons why pages are not indexed when they should be

  • You may have put directives in your robots.txt file that say that pages that should be indexed, like tags and categories, can’t be indexed. Tags and categories on your site are real URLs that point to other pages on your site.
User-agent: * 
Disallow: /tags/
  • It takes Googlebot a long time to read all the links they can find. So even though you could make a lot of redirects, or if the page isn’t there, Googlebot would stop looking.
  • Canonical link not set up properly. A canonical tag is used in the HTML header to tell Googlebot which is the preferred and canonical page if there is a lot of the same content on two different web pages. You should put one at the top to ensure every page has a canonical tag.

Reasons why pages are indexed

It is possible that pages that should not be indexed end up being indexed for various reasons. There are a few:

A page with a “no index” directive

Search engine bots must crawl a page with a “no index” directive for search engines to recognize that the page should not be indexed. As a result, several other websites connect to the pages. In addition, other websites may link to it even if robots.txt explicitly disallows a page from being indexed. In this instance, only the URL and anchor text are displayed in search engine results.

What to do:

There are a few things to look for in your robots.txt file to ensure that the robots.txt file is properly configured. Here are some:

  • Only one ‘user-agent’ block exists.
  • You must run your robots.txt file via a text editor that converts encodings to see the invisible Unicode characters. Special characters will be removed from the text.
  • The robots.txt blocking issue may be solved by encrypting the files on your server using a password.
  • Remove the pages from robots.txt or block them with the following meta tag: <meta name=” robots” content=”noindex”>

URLs that have been out of date for some time

Adding a ‘noindex’ directive to your robots.txt file will prevent your newly published content or site from being indexed. If you have just signed up for GSC, there are two ways to resolve the blocked by robots.txt issue:

What to do:

  • Allow Google to drop the old URLs from its index over time
  • Use 301 redirects for the old to new URLs
  • Plugins for 404 redirections are not recommended. However, GSC may send you the ‘blocked by robots.txt’ notice if the plugins are causing problems.

Disabled crawling of the entire website.

Follow the name of the directory you don’t want a crawler to access with the forward slash. Your confidential material may still be indexed and crawled despite the robots.txt file’s disabling of certain URLs, which might be accessed by anybody and potentially reveal the location of your private content.

Virtual Robots.txt files

You may still receive alerts even if you don’t have a robots.txt file. CMS (Customer Management Systems) sites, such as WordPress, contain virtual robots.txt files that prevent search engines from indexing them. Robots.txt files may also be included in plug-ins.

What to do:

Your actual robots.txt file must replace these simulated robots.txt files. Include a directive stating that all search engine bots should be able to explore your site in robots.txt of your website. It’s the only way they can control whether URLs are indexed or not.

Here is the code you can use to help bots crawl your site::

User-agent: *
Disallow: / 

It means ‘disallow nothing.’

A solution for “index, though block by robots.txt.”

  • Using Google Search Console, export a list of URLs and organize them alphabetically.
  • Check the URLs to see whether they contain URLs:
  • To be indexed, you must provide the following: Please enable Google to visit these URLs by amending your robots.txt file if necessary.
  • For URLs you don’t want to be crawled by search engines: Check to see if you have any internal links that should be removed from your robots.txt file.
  • For URLs that you do not want to be indexed. Update your robots.txt and apply robots noindex directives if this is the case.
  • Select a URL and click the TEST ROBOTS.TXT BLOCKING button in the window on the right side if you’re unsure what section of your robots.txt is blocking these URLs. Next, it opens a new window telling you which robots.txt lines block Google from reaching the URL.
  • To have Google re-evaluate your robots.txt against your URLs, click the VALIDATE FIX button when you’ve finished making the necessary adjustments.

Be aware of user-agent blocks

Blocking user-agents like Googlebot is known as a user-agent block. In other words, the site is banning a certain user agent because it has identified it as a spam bot.

If you can see a page normally in your default browser but cannot do so after switching to a different user agent, then the user agent you entered is prohibited.

What to do:

The solution depends on where you discover the block. Bots can be blocked by various systems, including .htaccess, server configurations, firewalls, CDNs, or even something your hosting company controls that you can’t see. To find out where the problem originates and how to fix it, you should speak with your hosting company or CDN.

A user agent can be blocked in .htaccess in various methods, such as the following examples.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]RewriteRule .* - [F,L]

Or…

BrowserMatchNoCase "Googlebot" bots
Order Allow,Deny
Allow from ALL
Deny from env=bots

Make sure that your IP address hasn’t been blocked.

If you’ve ruled out robots.txt and user-agent blocks, then an IP block is most likely the culprit.

What to do

Finding the source of IP restrictions is a tricky task. First, contact your web host or CDN to ask where the block originates and how you can fix it, much like with user-agent blocks.

You may be searching for something like this in .htaccess:

deny from 123.123.123.123

How to fix indexed though blocked by robots.txt in WordPress

For WordPress sites, addressing this issue is similar to the processes above, but here are a few hints to help you locate your robots.txt file quickly:

See whether you have a robots.txt file

GSC may also give you these messages even if you do not have a robots.txt file. CMSs like WordPress may already have a robots.txt file, and plugins may also have robots.txt files. Overwriting the simulated robots.txt files with your robots.txt files may result in GSC complications.

WordPress search engine visibility

It’s most probable that you disabled indexing in WordPress if the problem affects your entire website. New websites and website migrations are the most likely places to make this error. To see if it’s there follow these steps:

  1. You can access the settings by clicking ‘Settings.’
  2. To begin reading, click the ‘Reading’ button.
  3. Then, remove the check from ‘Search Engine Visibility.’

How to fix indexed though blocked by robots.txt with Yoast SEO

This is how to make changes to your robots.txt file if what you have is the Yoast SEO plugin:

  1. Login to your wp-admin area.
  2. In the Yoast SEO plugin’s sidebar, select Tools from the drop-down menu.
  3. Go to the File Editor window to begin editing your file.

How to fix indexed though blocked by robots.txt with Rank Math

Use the instructions below to modify your robots.txt file if you are using the Rank Math SEO plugin:

  1. Log in to your WordPress admin area.
  2. Go to Rank Math > General Settings in the sidebar.
  3. Go to Edit robots.txt.

How to fix indexed though blocked by robots.txt with All in One SEO

To make changes to your robots.txt file while using the All in One SEO plugin, follow these instructions:

  1. Log in to your WordPress admin area.
  2. In the sidebar, navigate to All in One SEO > Robots.txt to see the robots.txt file.

How to fix indexed though blocked by robots.txt via FTP

This technique necessitates using an FTP client to connect to your server. Log in using your site’s credentials in Filezilla to do this. Follow these steps when you’ve established a connection with your server:

  1. Take a look at the robots.txt file on your server.
  2. Remove the disallow rules for the impacted URLs in the file using a simple text editor like Microsoft Notepad or Text Editor on Mac.
  3. If you do not currently have a robots.txt file, use Notepad or Text Editor and name it robots.txt
  4. Save the file as-is with no alterations.
  5. The old robots.txt file is overwritten when you upload the new one to the server.

Using Google Search Console to validate your fixes

After the robots.txt file has been successfully modified, you can inform Google.

Go to the “Details” section and click on the warning.

There is only one thing you need to do from here: click on “Validate Fix.”

From this, Google will re-crawl the URLs, identify the noindex directives, and remove the sites in question from its index altogether. As a result, you’re now on your way to an SEO-friendly website free of errors.

Conclusion

Indexed, but banned by robots.txt warning: what it means, how to find the impacted URLs, and why it is shown. We’ve also looked at possible solutions. The warning does not mean that there is a problem on your website. However, if you don’t change it, your most critical pages may not get indexed, which is bad for the user experience.