Every WordPress site over two years old has the same quiet problem: hundreds of internal links pointing at redirected URLs, dozens of orphan pages nobody linked to, and a ?replytocom= parameter explosion in the access logs. You can’t see any of it from the front end. Googlebot can. So can GPTBot, ClaudeBot, and a dozen other AI crawlers now competing for the same server capacity.

This is where the Crawl Waste Audit comes in — a step-by-step cleanup that fixes the things actually wasting your crawl budget today. Run the steps in order if you’re starting from scratch. Skip to the ones you haven’t done if you’re partway through.

What Is Crawl Budget?

Crawl budget is the number of URLs a search engine is willing to fetch from your site in a given time window. It’s the product of two things Google watches simultaneously: how much your server can handle (crawl capacity) and how much Google actually wants to crawl your URLs (crawl demand). (Google Search Central)

A delivery driver analogy gets you 80% of the way there: a fixed shift, a list of stops, a manager prioritizing the route. But it misses the part that matters most now — your shift has to accommodate a fleet of AI bots all asking for the same packages.

Here’s the part Google’s docs are blunt about and most articles soften:

  • Crawling is per hostname, not per domain. blog.yoursite.com and shop.yoursite.com get separate budgets.
  • Crawling isn’t a ranking factor. “Improving your crawl rate won’t necessarily lead to better positions in Search results.” (Google)
  • Crawl ≠ index. Plenty of crawled URLs never make it into the index. That gap is usually quality, not budget.
Fix Internal Links Without Guesswork

Linkilo shows you what to link and where—based on real content context.

See How It Works

Why Crawl Budget Matters (And When It Doesn’t)

Most WordPress sites don’t have a crawl budget problem. John Mueller has said this on record: probably over 90% of sites don’t need to worry about it. Martin Splitt called it “a problem that is rare to be had.” (Search Engine Journal)

So before you spend a weekend optimizing, here’s the honest test:

Your SituationCrawl Budget Concern?
500-post personal blog, weekly publishingNo. Focus on content + internal linking.
2,000-page business site, no faceted navProbably not directly — but the cleanup helps.
WooCommerce store with filter URLsYes. Faceted nav is the #1 crawl waster globally.
10,000+ pages with daily updatesYes. Officially over Google’s threshold. (Google docs)
Major “Discovered – currently not indexed” gaps in GSCDiagnose first — usually quality, not budget.

That said, the side effects of a crawl waste audit — fewer soft 404s, no redirect chains, no orphan pages, faster TTFB, clean robots.txt — improve every site, regardless of size. The audit pays for itself even when crawl budget isn’t your bottleneck.

The opinionated stance I’m going to defend in this article: most internal linking advice is anchor-text astrology. What actually moves the needle on crawl budget is fixing the boring infrastructure — orphan pages, redirect chains, and the AI bots quietly hammering your archive pages overnight.

The Modern Crawl Reality (What’s Actually Going On Out There)

A lot of crawl budget advice still circulating was written for a different web. Three forces have rewritten the rules since then.

Google Wants to Crawl Less

Gary Illyes has been unusually direct about this on LinkedIn: his stated goal is to “figure out how to crawl the web even less.” Google’s “Crawling December” series reads like a coordinated plea for proper HTTP caching. The headline stat: only 0.017% of total Googlebot fetches are cacheable today, down from 0.026% a decade earlier. (Google Search Central Blog) Site owners are getting worse at letting Google save crawl effort, not better.

The Crawl Rate Limiter Is Gone

The old Search Console tool that let you throttle Googlebot is retired. If you genuinely need to slow Googlebot now, your only options are returning 500/503/429 status codes or filing a special request through the Googlebot Report form.

AI Crawlers Are a Real Force on Your Server

This is the big one. Cloudflare reports AI bots at roughly 4.2% of all HTML requests across their network — nearly matching Googlebot’s 4.5%. GPTBot’s share of AI crawler traffic has grown from 5% to 30%, a 305% jump. (Cloudflare)

Vercel’s monthly fetch data puts numbers on it:

Monthly Bot Fetch Volume

Based on Vercel’s network-wide fetch data. AI crawlers combined equal roughly 28% of Googlebot’s volume.

Googlebot

Search

4.5B

GPTBot

AI Training

569M

ClaudeBot

AI Training

370M

AppleBot

AI Training

314M

Why this matters for your crawl budget

Every AI bot request hits the same server Googlebot is trying to use. When TTFB rises because GPTBot is hammering archive pages overnight, Googlebot interprets it as “this site is slow” and crawls less.

Source: Vercel monthly fetch data, network-wide aggregate

Why this matters for crawl budget: every AI bot request hits the same server Googlebot is trying to use. When your TTFB rises because GPTBot is hammering archive pages overnight, Googlebot interprets that as “this site is slow” and crawls less. The two crawl economies are linked.

And one more thing that breaks a lot of assumptions: none of the major AI crawlers — GPTBot, ClaudeBot, PerplexityBot, AppleBot — render JavaScript. (Vercel) If your content depends on client-side rendering, you’re invisible to most AI search engines.

What You’ll Need

A few things before we start:

  • WordPress 5.8 or higher — most managed hosts run 6.x by default
  • Access to Google Search Console — Crawl Stats lives at Settings → Crawl Stats
  • Server log access — every managed host gives you raw logs (WP Engine User Portal → Access Logs, MyKinsta → Logs, Cloudways Server Management → Monitoring, SiteGround Site Tools → Statistics)
  • Linkilo — the free engines handle most of this audit (Crawl Log Analyzer, Site Scanner, 404 Monitor, Orphan Page Finder, Anchor Text Analysis are all free without an OpenAI key). The AI engine adds semantic link suggestions for the internal-linking step.

How to Audit and Fix Your Crawl Waste

Each step below is independent. If you’ve already configured your robots.txt for AI bots, skip step 3. If you’ve never run a Site Scanner, step 6 is probably your highest-impact win. Read the headers, find the gaps in your own setup, and run those.

1. Audit Crawl Activity With the Crawl Log Analyzer

The first question every audit should answer: what is Googlebot actually crawling on your site, and how much of it is waste?

GSC’s Crawl Stats report gives you a 90-day aggregate, which is fine for trending but useless for diagnosis. You can’t filter by URL pattern. You can’t see which AI bots are visiting. You can’t tell whether the 1,400 daily Googlebot hits are going to your money pages or to /?replytocom= URLs.

If you want to use GSC:

  • Navigate to the Google Search Console.
  • Go to “Settings” -> “Crawl stats” and “Open Report.”
  • Note the daily average number of pages crawled.
  • Divide the number of pages by the number of “average crawled per day.”
  • For values greater than 10, you have 10x the number of pages Google crawls daily. This means you should probably increase your crawl budget. You should read anything else if your number is less than 3.
crawl stats

What affects the crawl budget?

AMP, Hreflang, CSS, and JavaScript queries, such as XHR requests, count towards a site’s crawl. Hence, your crawl budget is affected by all URLs and requests, which can consume a site’s crawl budget.

crawls by file type

Crawling and parsing pages, sitemaps, RSS feeds, submitting URLs for indexing in Google Search Console, and using the indexing API, can all help you find these URLs. Numerous Google bots also share the crawl budget. The Crawl Stats report in GSC contains a list of the numerous Google bots crawling your page.

by googlebot type

Here’s how to do it:

  1. Open Linkilo → Crawl Logs in your WordPress admin.
  2. Click the Overview tab. Note your Health Score (0–100) — the composite of crawl frequency, page coverage, error rate, and response time, equally weighted.
  3. Click the Crawl Waste tab to see the 16-bucket Crawl Waste taxonomy: hard waste (404s, soft 404s, redirect chains), probable waste (faceted nav, action parameters, feeds), and configuration waste (admin URLs, login redirects, plugin-generated phantom URLs).
  4. Click the Bot Behavior tab to see your traffic split between the 40+ bots Linkilo tracks — Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, Google-Extended, AppleBot, Bytespider, Meta-ExternalAgent, and so on. Each AI bot is tagged by intent: live answer, training, or user-pasted link.

What you’re looking for, in this order:

  • Hard waste (4xx/5xx) above 5% of total Googlebot requests = drop everything and fix this first
  • Probable waste above 25% = your site has parameter or faceted nav bloat
  • AI bot traffic exceeding Googlebot = your server is being trained on, possibly without you knowing
  • Health Score below 70 = there’s a structural issue we’ll find in later steps

Sidenote. Linkilo’s crawl logging only captures requests that hit PHP. Static assets served directly by Nginx or Apache (like /wp-content/uploads/image.jpg) won’t appear here. For a full server-level log audit, pair this with raw access log analysis from your host. For 95% of crawl budget questions, the PHP layer is what matters anyway.

2. Diagnose Bot Behavior

Open the Bot Behavior tab and sort by request count. Then ask three questions:

Is GPTBot or ClaudeBot crawling more than Googlebot? That’s increasingly common. Anthropic’s ClaudeBot has been observed crawling 38,000 pages for every single referral visit it sends. (Cloudflare) If that ratio is killing your server, you have a decision to make about training opt-out (more on this in the next step).

Is Bingbot under-crawling you? Kinsta’s analysis of 13B requests found Bingbot accounts for 36% of bot traffic on WordPress sites — surprisingly high. If your Bingbot share is well under that, your sitemap submission to Bing Webmaster Tools may be incomplete.

Are unverified bots (failing reverse-DNS) showing up as Googlebot? Linkilo verifies bot identity via reverse-DNS lookup automatically. Anything claiming to be Googlebot that fails verification is a scraper — and it doesn’t count against your real crawl budget.

3. Update Your Robots.txt for the AI Era

Once you’ve seen who’s actually crawling you, the robots.txt decision becomes tactical instead of theoretical.

There are two AI bot categories most articles conflate:

  • Training bots — GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, CCBot, Bytespider, PerplexityBot, anthropic-ai, cohere-ai. These crawl your content to feed training datasets. Blocking them is opt-out from AI training.
  • User-triggered/search bots — OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User, Perplexity-User. These fetch pages when a user asks a live question. Blocking them = invisible in AI answer engines.

Most publishers now block training and allow search. Here’s the template I run on my own sites:

User-agent: *
Disallow: /wp-admin/
Disallow: /?s=
Disallow: /search/
Disallow: /*?replytocom=
Disallow: /*?add-to-cart=
Allow: /wp-admin/admin-ajax.php

# Block AI training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: CCBot
User-agent: Bytespider
User-agent: cohere-ai
User-agent: PerplexityBot
Disallow: /

# Allow user-triggered AI fetches (for visibility in AI answer engines)
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: Perplexity-User
Allow: /

Sitemap: https://example.com/sitemap_index.xml

A few notes:

  • Don’t block /wp-content/ or /wp-includes/. Googlebot needs CSS and JS to render pages, and blocking these breaks rendering.
  • /wp-json/ is fine to leave open unless you have specific reasons to block it.
  • If you’re on a managed host with staging environments, double-check that staging has Disallow: / and production doesn’t. This is the #1 traffic-killing migration mistake I’ve seen.

After deploying, return to Linkilo → Crawl Logs → Bot Behavior in 48 hours. The training bots you blocked should drop to zero verified hits. (Some scrapers will pretend to be GPTBot and ignore robots.txt — Linkilo’s verification flags these so you can block them at the firewall.)

Recommended reading: [link to: WordPress robots.txt complete guide]

4. Kill the Faceted Nav and Action Parameters

In a recent Search Off the Record episode, Gary Illyes broke down where 75% of Google’s crawling problems come from:

Where 75% of Crawl Waste Comes From

Gary Illyes’ breakdown of the primary sources of crawl problems on the web. Three of the five categories are pure WordPress issues.

75% accounted for

50% Faceted navigation

Filter and sort URLs (e.g. ?orderby=, ?filter_color=)

25% Action URLs

Add-to-cart, login, comment-reply links

10% Irrelevant query parameters

Tracking and utm_ parameters Google ignores

5% Buggy WordPress plugins

Plugins generating phantom URLs

2% Other / weird stuff

Edge cases and obscure patterns

WordPress focus: Action URLs, irrelevant parameters, and plugin bugs (40% combined) are pure WordPress problems you can fix with robots.txt and proper canonicals.

Three of those five buckets — action URLs, irrelevant parameters, plugin bugs — are pure WordPress problems. Here’s how to find and fix them:

  1. Go back to Linkilo → Crawl Logs → Crawl Waste.
  2. Filter the table by the Probable Waste tier.
  3. Sort by total Googlebot requests, descending.
  4. The top 10–20 rows are your specific waste patterns — ?orderby=, ?filter_color=, ?replytocom=, ?add-to-cart=, ?utm_source=, etc.

The fixes, in order of preference:

  • Block in robots.txt for parameters that should never be crawled (?add-to-cart=, ?replytocom=, tracking params)
  • Canonical to clean URL for filter/sort URLs you want users to access but Google to consolidate
  • For ?replytocom: Yoast strips this by default. AIOSEO has Crawl Cleanup. Verify yours is on.
  • For WooCommerce filters: use a plugin like Yoast WooCommerce SEO or Rank Math’s WooCommerce module to handle filter parameter SEO automatically

5. Hunt Soft 404s

These are pages returning a 200 OK status but with no real content — empty category archives, expired event listings, stale tag pages with three thin posts. Glenn Gabe quoted Illyes on this neatly: with soft 404s, you aren’t adding anything to the index and you’re wasting crawl budget.

GSC flags these in the Pages report under “Soft 404.” But it doesn’t show which template is generating them, and on WordPress that’s almost always the answer.

In Linkilo → Crawl Logs → Health, filter response codes by 200 and look at average page size. Pages under 1KB of HTML output are almost certainly soft 404s. Common culprits on WordPress:

  • Empty tag archives (turn off ones with fewer than 5 posts)
  • Author archives on single-author sites (redirect to homepage)
  • Date archives (typically safe to disable entirely)
  • Search result pages (?s=) — should be noindexed and robots-blocked
  • WooCommerce categories with no products

This is the step that surprises people the most. On a five-year-old WordPress site, it’s not unusual to find 200–400 internal links pointing at URLs that 301 to somewhere else. Each one is a wasted Googlebot request — Googlebot fetches the original URL, follows the redirect, then fetches the destination. Two requests for one page. Multiply by hundreds of links, and you’ve quietly tanked your crawl efficiency.

How Internal Links to Redirects Waste Crawl Budget

Every internal link pointing at a redirected URL forces Googlebot to make two requests for one destination page. Multiply by hundreds of links and crawl efficiency drops sharply.

Wasted Crawl

2 requests

Internal Link

/old-post-url

Request 1

301 Redirect

/old-post-url

301 Moved

Request 2

Final Page

/new-post-url

200 OK

Direct Link (Fixed)

1 request

Internal Link

/new-post-url

Request 1

Final Page

/new-post-url

200 OK

At scale on a typical 5-year-old WordPress site:

312

Internal links pointing at redirects (found in 2 min)

624

Googlebot requests wasted per crawl cycle

50%

Crawl efficiency reduction on affected paths

SearchPilot’s case study work found that pointing internal links directly to the final 200-status URL — rather than letting them redirect — measurably improved crawl efficiency.

The standard manual process: crawl your site with Screaming Frog, filter for “Inlinks to Redirect,” open each post, find the link, replace the URL, save. For a site with 200 affected links, that’s roughly a full day of clicking.

Here’s the faster version:

  1. Open Linkilo → Redirection → Site Scanner.
  2. Click Scan Entire Site. The scanner reads <a href> tags inside post_content across every published post and page, cross-referencing them against your active redirects.
  3. Review the results table — every internal link pointing at a redirected URL is listed with source post, target URL, and the destination it should be pointing to.
  4. Either fix individually (per link) or click Fix All to bulk-replace every detected link with its final destination URL. There’s a 24-hour undo window in case you change your mind.

Sidenote. The Site Scanner reads <a href> tags in post_content only. Programmatic links inside widgets, menus, and theme files aren’t audited — those need to be edited at the theme/widget level. For 90% of WordPress sites, post content is where the link debt lives.

This single step is often the highest-impact fix in the whole audit. On a site I ran this on recently — 1,800 published posts, six years of accumulated redirect debt — Site Scanner found 312 internal links pointing at redirected URLs in under two minutes.

7. Fix Orphan Pages

Botify’s analysis of enterprise sites found that pages with zero inbound internal links — orphan pages — consume an average of 26% of crawl budget. On unoptimized sites they’ve seen orphan ratios above 70%.

JetOctopus tracks the same pattern from a different angle: pages at click depth >4 from the homepage get crawled in 50% of cases or less. Mueller has been blunt about this in office hours: “It’s really, from the homepage or from the primary page, how quickly can we reach that specific page… as it moves away from the home page we’ll think probably this is less critical.”

Here’s how to find and fix orphans:

  1. Open Linkilo → Reports → Orphan Pages.
  2. The report lists every published post or page with zero inbound internal links from other posts on your site.
  3. For each orphan, decide: is this content I want indexed (add internal links) or content I don’t (noindex or delete)?
  4. For ones worth keeping, click into each post. In the editor, scroll to the AI Link Suggestions metabox.
  5. Click Get AI Suggestions. Linkilo returns 5–15 internal link opportunities ranked by composite score: 0.75 × post similarity + 0.15 × keyword overlap + 0.10 × cluster signal.

The composite score is the part that matters. A keyword-only matcher will suggest linking from a post about “WordPress hosting” to any other post mentioning “hosting” — even if the topical context is different. The semantic similarity weighting (0.75) is what catches the editorially correct link targets the keyword-only engine misses.

The free engine handles step 5 too — it does keyword matching against post titles and returns a smaller set of suggestions. For a quick orphan-rescue pass on a 500-post site, it’s usually sufficient. Where the AI engine pulls ahead is on synonym variations: a post about “newsletter signups” linking to a post about “email list growth,” for instance, where the surface keywords don’t overlap but the topical match is obvious.

8. Audit Anchor Text Diversity

While you’re in the internal-linking phase, run an anchor text audit. Over-optimized exact-match anchors look spammy to Google’s algorithms; thin or generic anchors (“click here,” “read more”) waste the crawl-prioritization signal Google’s Anchor Tag Indexing patent describes.

Linkilo’s 4-Phase Anchor Engine is the methodology to follow:

  1. Phase 1 — Exact keyword match (e.g., “internal linking strategy”)
  2. Phase 2 — Partial keyword match (e.g., “linking strategy”)
  3. Phase 3 — Title phrase (e.g., “guide to internal links”)
  4. Phase 4 — Salient word (a topically meaningful word from the source post)

The engine never invents anchors — every suggested anchor must exist verbatim in the source post. That sounds restrictive but it’s the right constraint: invented anchors that don’t appear in the source paragraph create unnatural link insertions Google can detect.

To audit your existing anchors:

  1. Open Linkilo → Reports → Anchor Text Analysis.
  2. Note your anchor diversity score — unique anchors divided by total internal links. Below 0.4 means you’re over-using a small set of anchors.
  3. Drill down by anchor to find ones used 50+ times. Those are candidates for rewrites.

9. Sort Out Performance and HTTP Caching

This is the underrated lever almost nobody implements properly. When Googlebot requests a page it has visited before, it can send an If-Modified-Since header. If your server responds with 304 Not Modified, Googlebot doesn’t re-download the content — it just notes the page is still valid. Massive crawl-time savings.

Google’s data shows only 0.017% of total fetches are cacheable today. (Google) They’re effectively asking site owners to enable this and being ignored.

For WordPress:

  • Cloudflare APO handles this at the CDN layer automatically — Cloudflare’s benchmarks show TTFB drops of 70%+ on WordPress sites
  • WP Rocket has cache-control header configuration in its advanced settings
  • Server-level: Nginx/Apache Last-Modified and ETag headers on dynamically generated pages
  • Managed hosts (Kinsta, WP Engine, Cloudways) usually handle this for you, but verify with httpstatus.io

Mueller’s stated TTFB target is around 300–400ms. Above that, Googlebot’s adaptive crawl rate kicks in and crawls less. Verify yours in Linkilo → Crawl Logs → Response Times — Linkilo tracks the average response time Googlebot specifically experienced on your site, which is more accurate than the synthetic benchmarks most speed tools give you.

10. Verify and Document

The audit isn’t done until you’ve measured the before/after. Here’s the verification pass:

  1. Open Linkilo → Crawl Logs → Overview. Note your new Health Score.
  2. Open GSC → Settings → Crawl Stats. Compare the latest Total Crawl Requests against your baseline from step 1.
  3. Run Linkilo → Reports → Summary Reports → Generate Snapshot. This auto-saves a weekly snapshot for ongoing comparison.
  4. Set Linkilo → Crawl Logs → Alerts to email you when waste percentage exceeds 25% again, so you don’t have to remember to re-audit manually.

Is It Really That Simple?

Mostly, yes. But two things deserve the asterisk treatment.

First, crawl budget is symptomatic. If your Health Score is 60 and your soft 404 count is 40 and your orphan ratio is 30%, fixing those things will move the score and clean up your logs — but the underlying reason those problems accumulated probably wasn’t a crawl budget oversight. It was that nobody owned site hygiene for two years. The audit fixes the artifacts. The discipline of running it on a schedule fixes the cause.

Second, internal linking has a ceiling. I said earlier that most internal linking advice is anchor-text astrology, and I’ll defend that. The honest version: well-structured internal linking gets you to maybe 80% of the crawl efficiency gain available on a typical WordPress site. The remaining 20% comes from things internal linking can’t fix — slow TTFB, faceted nav explosion, training bots eating your server, AI crawlers that don’t render JavaScript. If you’ve optimized internal linking and you’re still seeing crawl waste over 25%, the bottleneck is elsewhere.

The Crawl Waste Audit handles both — that’s why it’s ten steps and not three.

One more honest limit: Linkilo’s Site Scanner reads internal links inside post_content only. Programmatic links injected by widgets, menus, custom theme functions, or page builders that store layout in custom tables (some Elementor and Divi configurations) aren’t visible to the scanner. For most WordPress sites running standard themes and Gutenberg, that covers the vast majority of internal linking. For complex page-builder sites, you’ll want to manually audit the layout layer separately.

Common Misconceptions

BeliefReality
“Adding noindex saves crawl budget”No. Google still crawls the page to see the noindex tag.
“Disallowing in robots.txt frees up budget for other pages”Partially. Google won’t reallocate freed budget unless you’re already hitting your serving limit. (Google docs)
“Crawl-delay in robots.txt slows Googlebot”Google ignores crawl-delay entirely. Bing honors it.
“Smaller sites get crawled less”Not true. Small high-quality sites get crawled normally.
“URL parameters tool fixes parameter waste”That tool was deprecated. Use canonicals or robots.txt.
“AI bots don’t affect Google crawl budget”They do indirectly. Heavy AI bot load slows your server, signaling Googlebot to back off.
“llms.txt helps me get cited by ChatGPT”John Mueller said it plainly: no AI system currently uses llms.txt.

Final Thoughts

The mental model is straightforward: Google wants to crawl less. AI bots want to crawl more. Your job is to make sure every crawl earns its place by being fast, useful, and clearly connected to the rest of your site.

Run the Crawl Waste Audit once. Set up the alerts so you don’t have to remember it. Re-snapshot periodically to verify the fixes held. The compounding effect on a WordPress site that previously had nobody watching the logs is significant.

Frequently Asked Questions About Crawl Budget

Get answers to the most common questions about crawl budget, AI bots, and WordPress crawl optimization

What is crawl budget in simple terms?

+

Crawl budget is the number of URLs a search engine like Google is willing to fetch from your site in a given time window. It’s the product of two factors Google watches simultaneously: your server’s capacity to handle requests (crawl capacity) and Google’s interest in your content (crawl demand). Importantly, crawl budget is allocated per hostname, not per domain, meaning subdomains like blog.yoursite.com and shop.yoursite.com get separate budgets.

Does crawl budget matter for small WordPress sites?

+

For most sites under 10,000 pages, no — Google’s John Mueller has stated directly that over 90% of sites don’t need to worry about crawl budget, and Martin Splitt called it “a problem that is rare to be had.” The exception is when you have severe technical waste regardless of size: faceted navigation explosion, broken redirects, soft 404s, or orphan ratios above 25%. The cleanup itself benefits every site since side effects like faster TTFB and fewer redirect chains improve user experience.

Is crawl budget a ranking factor?

+

No. Google’s documentation explicitly states crawling is not a ranking factor and that improving your crawl rate won’t necessarily lead to better positions in search results. However, pages that aren’t crawled can’t be indexed, so it acts as a prerequisite for visibility. Additionally, crawl and index are different things — plenty of crawled URLs never make it into the index, and that gap is usually a quality issue, not a budget issue.

How do I check my crawl budget?

+

Open Google Search Console → Settings → Crawl Stats to see your 90-day aggregate data. Divide your Total Crawl Requests by 90 to get your average daily crawl rate. For URL-level diagnosis, GSC falls short — you can’t filter by URL pattern or see AI bot activity. Use Linkilo’s Crawl Log Analyzer for granular diagnosis, including the 16-bucket Crawl Waste taxonomy and bot behavior across 40+ crawlers including Googlebot, GPTBot, ClaudeBot, and PerplexityBot.

Does noindex save crawl budget?

+

No. Google still has to crawl the page to see the noindex tag in the first place — the tag tells Google not to index the content, not to stop crawling it. To actually save crawl budget on URLs you don’t need crawled, use a robots.txt Disallow directive instead. Keep in mind that disallowing URLs only frees up budget for other pages if you’re already hitting your serving limit, so the gain isn’t automatic.

Do AI crawlers affect my crawl budget?

+

Yes, indirectly. AI bots like GPTBot and ClaudeBot consume real server resources — Cloudflare data shows AI bots at roughly 4.2% of all HTML requests, nearly matching Googlebot’s 4.5%. Combined AI bot traffic equals around 28% of Googlebot’s volume. When your TTFB rises because AI crawlers are hammering your server, Googlebot interprets that as “this site is slow” and adapts by crawling less. The two crawl economies are directly linked through server response time.

Should I block GPTBot and ClaudeBot?

+

Most publishers now block training bots (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, PerplexityBot) and allow user-triggered search bots (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Perplexity-User). This preserves your visibility in AI answer engines while opting out of training datasets. The distinction matters: training bots crawl to feed datasets, while search bots fetch pages when users ask live questions. Blocking search bots makes you invisible in AI answer engines.

What’s the most common crawl budget mistake on WordPress?

+

Letting tag archives, search result pages, ?replytocom= URLs, and redirect-chained internal links accumulate without ever auditing them. Gary Illyes broke down where 75% of Google’s crawling problems come from: 50% faceted navigation, 25% action URLs, 10% irrelevant query parameters, and 5% buggy plugins generating phantom URLs. On a five-year-old WordPress site, it’s not unusual to find 200–400 internal links pointing at URLs that 301 redirect elsewhere — each one a wasted Googlebot request.

Is llms.txt worth implementing?

+

No. John Mueller has said plainly that no AI system currently uses llms.txt for ranking or citation. Treat it as optional. Your effort is better spent on the fundamentals: a clean robots.txt, fast TTFB, proper canonical tags, fixing redirect chains, and ensuring your content is server-rendered since none of the major AI crawlers — GPTBot, ClaudeBot, PerplexityBot, AppleBot — render JavaScript. If your content depends on client-side rendering, you’re invisible to most AI search engines.

What are soft 404s and why do they waste crawl budget?

+

Soft 404s are pages that return a 200 OK status but contain no real content — empty category archives, expired event listings, stale tag pages with three thin posts, or WooCommerce categories with no products. As Gary Illyes put it, with soft 404s you aren’t adding anything to the index and you’re wasting crawl budget. On WordPress, common culprits include empty tag archives, author archives on single-author sites, date archives, and search result pages. Disable, redirect, or noindex these depending on the template.

How do orphan pages affect crawl budget?

+

Orphan pages — pages with zero inbound internal links — quietly consume crawl resources. Botify’s analysis of enterprise sites found orphan pages consume an average of 26% of crawl budget, and on unoptimized sites they’ve seen orphan ratios above 70%. JetOctopus tracks the related issue of click depth: pages more than four clicks from the homepage get crawled in 50% of cases or less. Mueller has been clear that how quickly Google can reach a page from the homepage signals how critical Google thinks it is.

Does Google honor the crawl-delay directive in robots.txt?

+

No, Google ignores crawl-delay entirely. Bing honors it, but Google does not. The old Search Console tool that let you throttle Googlebot has also been retired. If you genuinely need to slow Googlebot now, your only options are returning 500, 503, or 429 status codes to signal server stress, or filing a special request through the Googlebot Report form. For most sites, the better approach is fixing the underlying performance issues so Googlebot’s adaptive crawl rate naturally adjusts to a healthy level.

How does HTTP caching improve crawl budget?

+

When Googlebot requests a page it has visited before, it can send an If-Modified-Since header. If your server responds with 304 Not Modified, Googlebot doesn’t re-download the content — saving massive crawl-time. The problem: Google’s data shows only 0.017% of total fetches are cacheable today, down from 0.026% a decade earlier. Site owners are getting worse at this. Implement proper Last-Modified and ETag headers, use Cloudflare APO (which can drop TTFB by 70%+), or configure cache-control through WP Rocket or your managed host.

What TTFB should I aim for to maximize crawl budget?

+

John Mueller’s stated TTFB target is around 300–400ms. Above that, Googlebot’s adaptive crawl rate kicks in and crawls less frequently. The challenge is that AI bot traffic now competes with Googlebot for server capacity — when GPTBot or ClaudeBot is hammering your archive pages overnight and pushing TTFB up, Googlebot interprets that as a signal to back off. Verify your actual Googlebot-experienced response time in your crawl logs rather than relying on synthetic benchmarks, which often miss the real-world load pattern.

Why are internal links pointing at redirects so harmful?

+

Every internal link pointing at a redirected URL forces Googlebot to make two requests for one page — fetch the original URL, follow the 301, then fetch the destination. Multiply that by hundreds of links on an older site and you’ve quietly tanked your crawl efficiency. SearchPilot’s case study work found that pointing internal links directly to the final 200-status URL — rather than letting them redirect — measurably improved crawl efficiency. This is often the single highest-impact fix in a crawl waste audit, especially on sites with several years of accumulated redirect debt.