Reference

Web Crawler and Bot Reference Guide

Understanding which bots visit your site is key to interpreting your crawl-log data. These automated programs serve very different purposes — search engine indexing, AI training, social media previews, SEO research.

2025 reality check: AI crawlers now make up over 50% of web traffic on many sites. Some collect data for AI training (you may want to block them); others fetch content when users ask AI assistants questions (these can drive referral traffic to your site).

Search engine crawlers

These index your pages so they appear in search results. Blocking them removes you from that search engine.

Googlebot

  • Operator: Google
  • Detection pattern: googlebot, google-structured-data-testing-tool
  • Purpose: Google's primary crawler. Operates in Desktop and Smartphone versions for mobile-first indexing.
  • Why track: Directly impacts your Google Search visibility. Higher frequency means Google sees your content as important.
  • Recommendation: Always Allow

Bingbot

  • Operator: Microsoft
  • Detection pattern: bingbot, msnbot, bingpreview
  • Purpose: Crawls for Bing search. Also powers Copilot and Microsoft Edge suggestions.
  • Why track: Essential for Bing users, Edge, Cortana, and Office integration.
  • Recommendation: Always Allow

DuckDuckBot

  • Operator: DuckDuckGo
  • Detection pattern: duckduckbot, duckduckgo
  • Purpose: Crawls for DuckDuckGo's privacy-focused search.
  • Recommendation: Recommended

YandexBot

  • Operator: Yandex
  • Detection pattern: yandexbot, yandex
  • Purpose: Russia's leading search engine crawler.
  • Recommendation: Regional (allow if targeting Russian-speaking markets)

Baiduspider

  • Operator: Baidu
  • Detection pattern: baiduspider, baidu
  • Purpose: China's dominant search crawler.
  • Recommendation: Regional (allow if targeting China)

Yahoo Slurp

  • Operator: Yahoo
  • Detection pattern: slurp, yahoo
  • Purpose: Historically Yahoo's primary crawler; Yahoo now largely uses Bing's index.
  • Recommendation: Optional

Applebot

  • Operator: Apple
  • Detection pattern: applebot
  • Purpose: Powers Siri suggestions, Spotlight, Safari intelligent features.
  • Recommendation: Recommended

Sogou Spider

  • Operator: Sogou (Tencent)
  • Detection pattern: sogou
  • Purpose: Chinese search engine alternative to Baidu, integrated with WeChat.
  • Recommendation: Regional (China)

PetalBot

  • Operator: Huawei
  • Detection pattern: petalbot
  • Purpose: Powers Petal Search on Huawei devices without Google services.
  • Recommendation: Regional

AI training crawlers

These collect content to train AI/LLM models. Blocking them will NOT affect your search rankings.

Safe to block: you can block every AI training bot without any impact on Google, Bing, or other search engine visibility.

GPTBot

  • Operator: OpenAI
  • Detection pattern: gptbot
  • Purpose: Collects training data for GPT models (ChatGPT, GPT-4). Highest-volume AI crawler — ~30% of all AI bot traffic.
  • Recommendation: Your Choice (safe to block)

Google-Extended

  • Operator: Google
  • Detection pattern: google-extended
  • Purpose: Training data for Google's Gemini. Completely separate from Googlebot.
  • Recommendation: Your Choice (safe to block — does NOT affect Google Search)

ClaudeBot

  • Operator: Anthropic
  • Detection pattern: claudebot
  • Purpose: Training data for Claude AI. Respects robots.txt.
  • Recommendation: Your Choice (safe to block)

Meta-ExternalAgent

  • Operator: Meta
  • Detection pattern: meta-externalagent
  • Purpose: Training data for Llama models. ~19% of AI crawler traffic. Different from Facebook's social preview bot.
  • Recommendation: Your Choice (safe to block)

PerplexityBot

  • Operator: Perplexity AI
  • Detection pattern: perplexitybot
  • Purpose: Fastest-growing AI crawler — 157,000%+ growth in 2024.
  • Recommendation: Your Choice (safe to block)

CCBot

  • Operator: Common Crawl Foundation
  • Detection pattern: ccbot
  • Purpose: Non-profit. Data ends up in many AI systems indirectly.
  • Recommendation: Your Choice (safe to block)

Bytespider

  • Operator: ByteDance (TikTok)
  • Detection pattern: bytespider
  • Purpose: ByteDance's AI crawler. Currently declining in volume.
  • Recommendation: Your Choice (safe to block)

Grok

  • Operator: xAI
  • Detection pattern: grok
  • Purpose: Powers Grok with real-time web access; integrated with X (Twitter).
  • Recommendation: Your Choice (safe to block)

Cohere-AI

  • Operator: Cohere
  • Detection pattern: cohere-ai
  • Purpose: Enterprise AI training data.
  • Recommendation: Your Choice (safe to block)

AI2Bot

  • Operator: Allen Institute for AI
  • Detection pattern: ai2bot
  • Purpose: Academic AI research data.
  • Recommendation: Your Choice (safe to block)

Diffbot

  • Operator: Diffbot
  • Detection pattern: diffbot
  • Purpose: AI-powered web scraping and knowledge graph building.
  • Recommendation: Your Choice (safe to block)

Applebot-Extended

  • Operator: Apple
  • Detection pattern: applebot-extended
  • Purpose: Apple Intelligence training. Separate from regular Applebot.
  • Recommendation: Your Choice (safe to block — does NOT affect Siri/Spotlight)

Amazonbot

  • Operator: Amazon
  • Detection pattern: amazonbot
  • Purpose: Powers Alexa answers and AI training.
  • Recommendation: Your Choice (blocking may affect how Alexa answers about your content)

YouBot

  • Operator: You.com
  • Detection pattern: youbot
  • Recommendation: Your Choice (safe to block)

Omgilibot

  • Operator: Omgili / Webz.io
  • Detection pattern: omgili
  • Purpose: Powers news and social media monitoring.
  • Recommendation: Your Choice (safe to block)

AI assistant crawlers (user-triggered)

These fetch pages when a real user asks an AI assistant a question. They can drive referral traffic when AI cites your content.

Consider allowing these. Unlike training crawlers, these represent actual users seeking information. When AI cites your content, those users may click through to your site.

ChatGPT-User

  • Operator: OpenAI
  • Detection pattern: chatgpt-user
  • Purpose: Real-time fetch when ChatGPT users ask questions. 2,825%+ growth in 2024.
  • Recommendation: Consider Allowing (can drive traffic)

OAI-SearchBot

  • Operator: OpenAI
  • Detection pattern: oai-searchbot
  • Purpose: Powers ChatGPT's search feature.
  • Recommendation: Consider Allowing

Claude-User

  • Operator: Anthropic
  • Detection pattern: claude-user
  • Purpose: Fetches content when Claude users request current information.
  • Recommendation: Consider Allowing

Perplexity-User

  • Operator: Perplexity AI
  • Detection pattern: perplexity-user
  • Purpose: Powers Perplexity's AI search answers. Note: known to sometimes ignore robots.txt.
  • Recommendation: Consider Allowing (high referral potential)

Social media crawlers

These generate link previews when your URLs are shared. Blocking them breaks your link previews.

facebookexternalhit

  • Operator: Meta
  • Detection pattern: facebookexternalhit, facebot
  • Purpose: Link previews for Facebook, Instagram, Messenger, WhatsApp.
  • Recommendation: Always Allow

Twitterbot

  • Operator: X (Twitter)
  • Detection pattern: twitterbot
  • Recommendation: Always Allow

LinkedInBot

  • Operator: LinkedIn
  • Detection pattern: linkedinbot
  • Recommendation: Always Allow

Pinterest

  • Operator: Pinterest
  • Detection pattern: pinterest, pinterestbot
  • Purpose: Indexes images for Pinterest's visual discovery.
  • Recommendation: Always Allow

Slackbot

  • Operator: Slack (Salesforce)
  • Detection pattern: slackbot
  • Recommendation: Recommended

Discordbot

  • Operator: Discord
  • Detection pattern: discordbot
  • Recommendation: Recommended

TelegramBot

  • Operator: Telegram
  • Detection pattern: telegrambot
  • Recommendation: Recommended

WhatsApp

  • Operator: Meta
  • Detection pattern: whatsapp
  • Recommendation: Always Allow

SEO tool crawlers

Commercial bots gathering data for SEO analysis tools. Generally safe to allow but can be resource-intensive.

AhrefsBot

  • Operator: Ahrefs
  • Detection pattern: ahrefsbot, ahrefs
  • Purpose: Backlink database, keyword research, site audits.
  • Recommendation: Allow (consider rate-limiting if needed)

SemrushBot

  • Operator: Semrush
  • Detection pattern: semrushbot, semrush
  • Recommendation: Allow (consider rate-limiting if needed)

MJ12bot

  • Operator: Majestic
  • Detection pattern: mj12bot, majestic
  • Purpose: Trust Flow, Citation Flow, link authority metrics.
  • Recommendation: Allow

DotBot

  • Operator: Moz
  • Detection pattern: dotbot, moz, opensiteexplorer
  • Purpose: Calculates Domain Authority and Page Authority.
  • Recommendation: Allow

Rogerbot

  • Operator: Moz
  • Detection pattern: rogerbot
  • Recommendation: Allow

Screaming Frog

  • Operator: Screaming Frog Ltd
  • Detection pattern: screaming frog
  • Purpose: Popular desktop SEO crawler — often used by SEO pros (yours or competitors').
  • Recommendation: Allow

SEOkicks

  • Operator: SEOkicks
  • Detection pattern: seokicks
  • Recommendation: Allow

SISTRIX

  • Operator: SISTRIX
  • Detection pattern: sistrix
  • Purpose: Powers SISTRIX visibility index — popular in European markets.
  • Recommendation: Allow

SerpstatBot

  • Operator: Serpstat
  • Detection pattern: serpstatbot
  • Recommendation: Allow

Quick robots.txt reference

Block all AI training bots

User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: PerplexityBot
Disallow: /

Allow AI assistant bots (for referral traffic)

User-agent: ChatGPT-User
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: OAI-SearchBot
Allow: /

Not all bots respect robots.txt. For stronger control, use server-level blocking via .htaccess rules or a Web Application Firewall (WAF).


This reference is part of Linkilo's Crawler Analyzer feature.

Was this article helpful?


© Copyright 2024, All Rights Reserved